MACHINE LEARNING WITH CALIBRATION TECHNIQUE TO DETECT FRAUDULENT CREDIT CARD TRANSACTIONS

Goal:

Predict the probability of an online credit card transaction being fraudulent, based on different properties of the transactions.

1. Setup Environment

The goal of this section is to:

  • Import all the packages
  • Set the options for data visualizations
In [1]:
# Data Manipulation
import numpy as np 
import pandas as pd 

# Data Visualization
import seaborn as sns 
import matplotlib.pyplot as plt
import matplotlib.lines as mlines

# Time
import time
import datetime

# Machine Learning
from   sklearn.preprocessing import LabelEncoder, minmax_scale
from   sklearn.ensemble import RandomForestClassifier
from   sklearn.decomposition import PCA
from   sklearn.model_selection import train_test_split, GridSearchCV
from   sklearn.metrics import confusion_matrix , classification_report, accuracy_score, roc_auc_score, plot_roc_curve, precision_recall_curve, plot_precision_recall_curve
from   sklearn.calibration import calibration_curve
from   sklearn.calibration import CalibratedClassifierCV

from   xgboost import XGBClassifier
from   lightgbm import LGBMClassifier

from   imblearn.over_sampling import RandomOverSampler
from   scipy.stats import chi2_contingency,  f_oneway

import gc
import warnings
from   tqdm import tqdm


# Set Options
pd.set_option('display.max_rows', 800)
pd.set_option('display.max_columns', 500)
%matplotlib inline
warnings.filterwarnings("ignore")

2. Data Overview

Purpose is to:

  1. Load the datasets
  2. Explore the features

The data is broken into two files identity and transaction, which are joined by “TransactionID”.

Note: Not all transactions have corresponding identity information.

Load the transaction and identity datasets using pd.read_csv()

In [3]:
%%time
# Load Data
df_id   = pd.read_csv('Data/train_identity.csv')
df_tran = pd.read_csv('Data/train_transaction.csv')
Wall time: 21.3 s
In [4]:
# Identitiy Data
df_id.sample(6)
Out[4]:
TransactionID id_01 id_02 id_03 id_04 id_05 id_06 id_07 id_08 id_09 id_10 id_11 id_12 id_13 id_14 id_15 id_16 id_17 id_18 id_19 id_20 id_21 id_22 id_23 id_24 id_25 id_26 id_27 id_28 id_29 id_30 id_31 id_32 id_33 id_34 id_35 id_36 id_37 id_38 DeviceType DeviceInfo
601 2990046 -5.0 821814.0 NaN NaN 0.0 0.0 NaN NaN NaN NaN 100.0 NotFound 49.0 -300.0 New NotFound 102.0 15.0 410.0 360.0 NaN NaN NaN NaN NaN NaN NaN New NotFound iOS 11.1.2 mobile safari 11.0 32.0 2208x1242 match_status:1 T F F F mobile iOS Device
29589 3065170 -10.0 103219.0 0.0 0.0 2.0 -5.0 NaN NaN 0.0 0.0 100.0 Found 52.0 NaN Found Found 166.0 13.0 216.0 214.0 NaN NaN NaN NaN NaN NaN NaN Found Found NaN ie 11.0 for desktop NaN NaN NaN F F T T desktop rv:11.0
97539 3342636 -20.0 175191.0 NaN NaN 0.0 0.0 NaN NaN NaN NaN 100.0 NotFound 33.0 NaN New NotFound 225.0 NaN 266.0 305.0 NaN NaN NaN NaN NaN NaN NaN New NotFound NaN samsung browser generic NaN NaN NaN F F T F mobile SAMSUNG SM-G610M Build/NRD90M
141317 3561291 -5.0 135838.0 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 100.0 NotFound 52.0 -480.0 Found Found 166.0 NaN 193.0 222.0 NaN NaN NaN NaN NaN NaN NaN Found Found Windows 10 chrome 66.0 24.0 1920x1080 match_status:2 T F T F desktop Windows
65961 3161995 -5.0 22457.0 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 100.0 NotFound 49.0 -360.0 Found Found 166.0 NaN 193.0 333.0 NaN NaN NaN NaN NaN NaN NaN Found Found Mac OS X 10_13_2 chrome 63.0 24.0 2560x1600 match_status:2 T F T F desktop MacOS
138982 3549519 -10.0 417370.0 NaN NaN 3.0 -23.0 NaN NaN NaN NaN 100.0 NotFound 27.0 NaN New NotFound 225.0 15.0 290.0 127.0 NaN NaN NaN NaN NaN NaN NaN New NotFound NaN mobile safari 11.0 NaN NaN NaN F F T F mobile NaN

Identity Data Description

Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions. They're collected by Vesta’s fraud protection system and digital security partners. (The field names are masked and pairwise dictionary will not be provided for privacy protection and contract agreement)

Categorical Features:

  • DeviceType
  • DeviceInfo
  • id_12 - id_38
In [5]:
# Transaction Data
df_tran.head()
Out[5]:
TransactionID isFraud TransactionDT TransactionAmt ProductCD card1 card2 card3 card4 card5 card6 addr1 addr2 dist1 dist2 P_emaildomain R_emaildomain C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 M1 M2 M3 M4 M5 M6 M7 M8 M9 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60 V61 V62 V63 V64 V65 V66 V67 V68 V69 V70 V71 V72 V73 V74 V75 V76 V77 V78 V79 V80 V81 V82 V83 V84 V85 V86 V87 V88 V89 V90 V91 V92 V93 V94 V95 V96 V97 V98 V99 V100 V101 V102 V103 V104 V105 V106 V107 V108 V109 V110 V111 V112 V113 V114 V115 V116 V117 V118 V119 V120 V121 V122 V123 V124 V125 V126 V127 V128 V129 V130 V131 V132 V133 V134 V135 V136 V137 V138 V139 V140 V141 V142 V143 V144 V145 V146 V147 V148 V149 V150 V151 V152 V153 V154 V155 V156 V157 V158 V159 V160 V161 V162 V163 V164 V165 V166 V167 V168 V169 V170 V171 V172 V173 V174 V175 V176 V177 V178 V179 V180 V181 V182 V183 V184 V185 V186 V187 V188 V189 V190 V191 V192 V193 V194 V195 V196 V197 V198 V199 V200 V201 V202 V203 V204 V205 V206 V207 V208 V209 V210 V211 V212 V213 V214 V215 V216 V217 V218 V219 V220 V221 V222 V223 V224 V225 V226 V227 V228 V229 V230 V231 V232 V233 V234 V235 V236 V237 V238 V239 V240 V241 V242 V243 V244 V245 V246 V247 V248 V249 V250 V251 V252 V253 V254 V255 V256 V257 V258 V259 V260 V261 V262 V263 V264 V265 V266 V267 V268 V269 V270 V271 V272 V273 V274 V275 V276 V277 V278 V279 V280 V281 V282 V283 V284 V285 V286 V287 V288 V289 V290 V291 V292 V293 V294 V295 V296 V297 V298 V299 V300 V301 V302 V303 V304 V305 V306 V307 V308 V309 V310 V311 V312 V313 V314 V315 V316 V317 V318 V319 V320 V321 V322 V323 V324 V325 V326 V327 V328 V329 V330 V331 V332 V333 V334 V335 V336 V337 V338 V339
0 2987000 0 86400 68.5 W 13926 NaN 150.0 discover 142.0 credit 315.0 87.0 19.0 NaN NaN NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 2.0 0.0 1.0 1.0 14.0 NaN 13.0 NaN NaN NaN NaN NaN NaN 13.0 13.0 NaN NaN NaN 0.0 T T T M2 F T NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 117.0 0.0 0.0 0.0 0.0 0.0 117.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 117.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 117.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2987001 0 86401 29.0 W 2755 404.0 150.0 mastercard 102.0 credit 325.0 87.0 NaN NaN gmail.com NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN NaN 0.0 NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN NaN NaN M0 T T NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2987002 0 86469 59.0 W 4663 490.0 150.0 visa 166.0 debit 330.0 87.0 287.0 NaN outlook.com NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN NaN 0.0 NaN NaN NaN NaN NaN 0.0 315.0 NaN NaN NaN 315.0 T T T M0 F F F F F 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2987003 0 86499 50.0 W 18132 567.0 150.0 mastercard 117.0 debit 476.0 87.0 NaN NaN yahoo.com NaN 2.0 5.0 0.0 0.0 0.0 4.0 0.0 0.0 1.0 0.0 1.0 0.0 25.0 1.0 112.0 112.0 0.0 94.0 0.0 NaN NaN NaN NaN 84.0 NaN NaN NaN NaN 111.0 NaN NaN NaN M0 T F NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 48.0 28.0 0.0 10.0 4.0 1.0 38.0 24.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 50.0 1758.0 925.0 0.0 354.0 135.0 50.0 1404.0 790.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 28.0 0.0 0.0 0.0 0.0 10.0 0.0 4.0 0.0 0.0 1.0 1.0 1.0 1.0 38.0 24.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 50.0 1758.0 925.0 0.0 354.0 0.0 135.0 0.0 0.0 0.0 50.0 1404.0 790.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2987004 0 86506 50.0 H 4497 514.0 150.0 mastercard 102.0 credit 420.0 87.0 NaN NaN gmail.com NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.0 18.0 140.0 0.0 0.0 0.0 0.0 1803.0 49.0 64.0 0.0 0.0 0.0 0.0 0.0 0.0 15557.990234 169690.796875 0.0 0.0 0.0 515.0 5155.0 2840.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Transaction Data Description

  • TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
  • TransactionAMT: transaction payment amount in USD
  • ProductCD: product code, the product for each transaction
  • card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
  • addr: address
  • dist: distance
  • P_ and (R__) emaildomain: purchaser and recipient email domain
  • C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
  • D1-D15: timedelta, such as days between previous transaction, etc.
  • M1-M9: match, such as names on card and address, etc.
  • Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.

3. Optimize Memory Used by Data

There are few features which has data that needs smaller memory size to hold it but the current data type is occupying more memory. Hence reducing memory usage by the data is very much necessary. This section gives a function to modify the data type of each feature.

Memory occupied by the dataframe (in mb)

In [4]:
df_id.memory_usage(deep=True).sum() / 1024**2  
Out[4]:
157.63398933410645
In [5]:
df_tran.memory_usage(deep=True).sum() / 1024**2
Out[5]:
2100.701406478882
In [6]:
df_tran.dtypes
Out[6]:
TransactionID       int64
isFraud             int64
TransactionDT       int64
TransactionAmt    float64
ProductCD          object
card1               int64
card2             float64
card3             float64
card4              object
card5             float64
card6              object
addr1             float64
addr2             float64
dist1             float64
dist2             float64
P_emaildomain      object
R_emaildomain      object
C1                float64
C2                float64
C3                float64
C4                float64
C5                float64
C6                float64
C7                float64
C8                float64
C9                float64
C10               float64
C11               float64
C12               float64
C13               float64
C14               float64
D1                float64
D2                float64
D3                float64
D4                float64
D5                float64
D6                float64
D7                float64
D8                float64
D9                float64
D10               float64
D11               float64
D12               float64
D13               float64
D14               float64
D15               float64
M1                 object
M2                 object
M3                 object
M4                 object
M5                 object
M6                 object
M7                 object
M8                 object
M9                 object
V1                float64
V2                float64
V3                float64
V4                float64
V5                float64
V6                float64
V7                float64
V8                float64
V9                float64
V10               float64
V11               float64
V12               float64
V13               float64
V14               float64
V15               float64
V16               float64
V17               float64
V18               float64
V19               float64
V20               float64
V21               float64
V22               float64
V23               float64
V24               float64
V25               float64
V26               float64
V27               float64
V28               float64
V29               float64
V30               float64
V31               float64
V32               float64
V33               float64
V34               float64
V35               float64
V36               float64
V37               float64
V38               float64
V39               float64
V40               float64
V41               float64
V42               float64
V43               float64
V44               float64
V45               float64
V46               float64
V47               float64
V48               float64
V49               float64
V50               float64
V51               float64
V52               float64
V53               float64
V54               float64
V55               float64
V56               float64
V57               float64
V58               float64
V59               float64
V60               float64
V61               float64
V62               float64
V63               float64
V64               float64
V65               float64
V66               float64
V67               float64
V68               float64
V69               float64
V70               float64
V71               float64
V72               float64
V73               float64
V74               float64
V75               float64
V76               float64
V77               float64
V78               float64
V79               float64
V80               float64
V81               float64
V82               float64
V83               float64
V84               float64
V85               float64
V86               float64
V87               float64
V88               float64
V89               float64
V90               float64
V91               float64
V92               float64
V93               float64
V94               float64
V95               float64
V96               float64
V97               float64
V98               float64
V99               float64
V100              float64
V101              float64
V102              float64
V103              float64
V104              float64
V105              float64
V106              float64
V107              float64
V108              float64
V109              float64
V110              float64
V111              float64
V112              float64
V113              float64
V114              float64
V115              float64
V116              float64
V117              float64
V118              float64
V119              float64
V120              float64
V121              float64
V122              float64
V123              float64
V124              float64
V125              float64
V126              float64
V127              float64
V128              float64
V129              float64
V130              float64
V131              float64
V132              float64
V133              float64
V134              float64
V135              float64
V136              float64
V137              float64
V138              float64
V139              float64
V140              float64
V141              float64
V142              float64
V143              float64
V144              float64
V145              float64
V146              float64
V147              float64
V148              float64
V149              float64
V150              float64
V151              float64
V152              float64
V153              float64
V154              float64
V155              float64
V156              float64
V157              float64
V158              float64
V159              float64
V160              float64
V161              float64
V162              float64
V163              float64
V164              float64
V165              float64
V166              float64
V167              float64
V168              float64
V169              float64
V170              float64
V171              float64
V172              float64
V173              float64
V174              float64
V175              float64
V176              float64
V177              float64
V178              float64
V179              float64
V180              float64
V181              float64
V182              float64
V183              float64
V184              float64
V185              float64
V186              float64
V187              float64
V188              float64
V189              float64
V190              float64
V191              float64
V192              float64
V193              float64
V194              float64
V195              float64
V196              float64
V197              float64
V198              float64
V199              float64
V200              float64
V201              float64
V202              float64
V203              float64
V204              float64
V205              float64
V206              float64
V207              float64
V208              float64
V209              float64
V210              float64
V211              float64
V212              float64
V213              float64
V214              float64
V215              float64
V216              float64
V217              float64
V218              float64
V219              float64
V220              float64
V221              float64
V222              float64
V223              float64
V224              float64
V225              float64
V226              float64
V227              float64
V228              float64
V229              float64
V230              float64
V231              float64
V232              float64
V233              float64
V234              float64
V235              float64
V236              float64
V237              float64
V238              float64
V239              float64
V240              float64
V241              float64
V242              float64
V243              float64
V244              float64
V245              float64
V246              float64
V247              float64
V248              float64
V249              float64
V250              float64
V251              float64
V252              float64
V253              float64
V254              float64
V255              float64
V256              float64
V257              float64
V258              float64
V259              float64
V260              float64
V261              float64
V262              float64
V263              float64
V264              float64
V265              float64
V266              float64
V267              float64
V268              float64
V269              float64
V270              float64
V271              float64
V272              float64
V273              float64
V274              float64
V275              float64
V276              float64
V277              float64
V278              float64
V279              float64
V280              float64
V281              float64
V282              float64
V283              float64
V284              float64
V285              float64
V286              float64
V287              float64
V288              float64
V289              float64
V290              float64
V291              float64
V292              float64
V293              float64
V294              float64
V295              float64
V296              float64
V297              float64
V298              float64
V299              float64
V300              float64
V301              float64
V302              float64
V303              float64
V304              float64
V305              float64
V306              float64
V307              float64
V308              float64
V309              float64
V310              float64
V311              float64
V312              float64
V313              float64
V314              float64
V315              float64
V316              float64
V317              float64
V318              float64
V319              float64
V320              float64
V321              float64
V322              float64
V323              float64
V324              float64
V325              float64
V326              float64
V327              float64
V328              float64
V329              float64
V330              float64
V331              float64
V332              float64
V333              float64
V334              float64
V335              float64
V336              float64
V337              float64
V338              float64
V339              float64
dtype: object

Certain features occupy more memory than what is needed to store them. Reducing the memory usage by changing data type will speed up the computations.

Let's create a function for that:

  • int8 / uint8 : consumes 1 byte of memory, range between -128/127 or 0/255
  • bool : consumes 1 byte, true or false
  • float16 / int16 / uint16: consumes 2 bytes of memory, range between -32768 and 32767 or 0/65535
  • float32 / int32 / uint32 : consumes 4 bytes of memory, range between -2147483648 and 2147483647
  • float64 / int64 / uint64: consumes 8 bytes of memory
In [7]:
print('int64 min: ', np.iinfo(np.int64).min)
print('int64 max: ', np.iinfo(np.int64).max)
int64 min:  -9223372036854775808
int64 max:  9223372036854775807
In [8]:
print('int8 min: ', np.iinfo(np.int8).min)
print('int8 max: ', np.iinfo(np.int8).max)
int8 min:  -128
int8 max:  127
In [9]:
# Reduce memory usage
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage(deep=True).sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min >= np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min >= np.iinfo(np.int16).min and c_max <= np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min >= np.iinfo(np.int32).min and c_max <= np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min >= np.iinfo(np.int64).min and c_max <= np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min >= np.finfo(np.float16).min and c_max <= np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min >= np.finfo(np.float32).min and c_max <= np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage(deep=True).sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

Use the defined function to reduce the memory usage

In [10]:
# Reduce the memory size of the dataframe
df_id   = reduce_mem_usage(df_id)
df_tran = reduce_mem_usage(df_tran)
Mem. usage decreased to 138.38 Mb (12.2% reduction)
Mem. usage decreased to 867.89 Mb (58.7% reduction)

4. Basic Data Stats

Before attempting to solve the problem, it's very important to have a good understanding of data.

The goal of this section is to:

  • Get the dimensions of data
  • Get the summary of data
  • Get various statistics of data

Shape of dataframe

In [13]:
# Dimensions of identity dataset
print(df_id.shape)
(144233, 41)

The dataset has 144233 rows and 41 columns

In [14]:
# Dimensions of transaction dataset
print(df_tran.shape)
(590540, 394)

The dataset has 590540 rows and 394 columns

Check how many transactions has ID info

In [15]:
# How many had ID info?
df_tran.TransactionID.isin(df_id.TransactionID).sum()
Out[15]:
144233

Summary of dataframe

In [16]:
df_id.head()
Out[16]:
TransactionID id_01 id_02 id_03 id_04 id_05 id_06 id_07 id_08 id_09 id_10 id_11 id_12 id_13 id_14 id_15 id_16 id_17 id_18 id_19 id_20 id_21 id_22 id_23 id_24 id_25 id_26 id_27 id_28 id_29 id_30 id_31 id_32 id_33 id_34 id_35 id_36 id_37 id_38 DeviceType DeviceInfo
0 2987004 0.0 70787.0 NaN NaN NaN NaN NaN NaN NaN NaN 100.0 NotFound NaN -480.0 New NotFound 166.0 NaN 542.0 144.0 NaN NaN NaN NaN NaN NaN NaN New NotFound Android 7.0 samsung browser 6.2 32.0 2220x1080 match_status:2 T F T T mobile SAMSUNG SM-G892A Build/NRD90M
1 2987008 -5.0 98945.0 NaN NaN 0.0 -5.0 NaN NaN NaN NaN 100.0 NotFound 49.0 -300.0 New NotFound 166.0 NaN 621.0 500.0 NaN NaN NaN NaN NaN NaN NaN New NotFound iOS 11.1.2 mobile safari 11.0 32.0 1334x750 match_status:1 T F F T mobile iOS Device
2 2987010 -5.0 191631.0 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 100.0 NotFound 52.0 NaN Found Found 121.0 NaN 410.0 142.0 NaN NaN NaN NaN NaN NaN NaN Found Found NaN chrome 62.0 NaN NaN NaN F F T T desktop Windows
3 2987011 -5.0 221832.0 NaN NaN 0.0 -6.0 NaN NaN NaN NaN 100.0 NotFound 52.0 NaN New NotFound 225.0 NaN 176.0 507.0 NaN NaN NaN NaN NaN NaN NaN New NotFound NaN chrome 62.0 NaN NaN NaN F F T T desktop NaN
4 2987016 0.0 7460.0 0.0 0.0 1.0 0.0 NaN NaN 0.0 0.0 100.0 NotFound NaN -300.0 Found Found 166.0 15.0 529.0 575.0 NaN NaN NaN NaN NaN NaN NaN Found Found Mac OS X 10_11_6 chrome 62.0 24.0 1280x800 match_status:2 T F T T desktop MacOS
In [17]:
from pandas_summary import DataFrameSummary
df_id_summary = DataFrameSummary(df_id)
df_id_summary.summary()
Out[17]:
TransactionID id_01 id_02 id_03 id_04 id_05 id_06 id_07 id_08 id_09 id_10 id_11 id_12 id_13 id_14 id_15 id_16 id_17 id_18 id_19 id_20 id_21 id_22 id_23 id_24 id_25 id_26 id_27 id_28 id_29 id_30 id_31 id_32 id_33 id_34 id_35 id_36 id_37 id_38 DeviceType DeviceInfo
count 144233 144233 140872 66324 66324 136865 136865 5155 5155 74926 74926 140978 NaN 127320 80044 NaN NaN 139369 45113 139318 139261 5159 5169 NaN 4747 5132 5163 NaN NaN NaN NaN NaN 77586 NaN NaN NaN NaN NaN NaN NaN NaN
mean 3.23633e+06 NaN 174717 0 -0 NaN NaN inf -inf 0 -0 NaN NaN NaN NaN NaN NaN NaN inf NaN NaN inf inf NaN 12.7891 inf inf NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
std 178850 0 159652 0 0 0 0 11.3828 26.0781 0 0 0 NaN 0 NaN NaN NaN 0 1.56152 NaN NaN inf 6.89844 NaN 2.37109 97.4375 32.0938 NaN NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN
min 2.987e+06 -100 1 -13 -28 -72 -100 -46 -100 -36 -100 90 NaN 10 -660 NaN NaN 100 10 100 100 100 10 NaN 11 100 100 NaN NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN
25% 3.07714e+06 -10 67992 0 0 0 -6 5 -48 0 0 100 NaN 49 -360 NaN NaN 166 13 266 256 252 14 NaN 11 321 119 NaN NaN NaN NaN NaN 24 NaN NaN NaN NaN NaN NaN NaN NaN
50% 3.19882e+06 -5 125800 0 0 0 0 14 -34 0 0 100 NaN 52 -300 NaN NaN 166 15 341 472 252 14 NaN 11 321 149 NaN NaN NaN NaN NaN 24 NaN NaN NaN NaN NaN NaN NaN NaN
75% 3.39292e+06 -5 228749 0 0 1 0 22 -23 0 0 100 NaN 52 -300 NaN NaN 225 15 427 533 486.5 14 NaN 15 371 169 NaN NaN NaN NaN NaN 32 NaN NaN NaN NaN NaN NaN NaN NaN
max 3.57753e+06 0 999595 10 0 52 0 61 0 25 0 100 NaN 64 720 NaN NaN 229 29 671 661 854 44 NaN 26 548 216 NaN NaN NaN NaN NaN 32 NaN NaN NaN NaN NaN NaN NaN NaN
counts 144233 144233 140872 66324 66324 136865 136865 5155 5155 74926 74926 140978 144233 127320 80044 140985 129340 139369 45113 139318 139261 5159 5169 5169 4747 5132 5163 5169 140978 140978 77565 140282 77586 73289 77805 140985 140985 140985 140985 140810 118666
uniques 144233 77 115655 24 15 93 101 84 94 46 62 146 2 54 25 3 2 104 18 522 394 490 25 3 12 341 95 2 2 2 75 130 4 260 4 2 2 2 2 2 1786
missing 0 0 3361 77909 77909 7368 7368 139078 139078 69307 69307 3255 0 16913 64189 3248 14893 4864 99120 4915 4972 139074 139064 139064 139486 139101 139070 139064 3255 3255 66668 3951 66647 70944 66428 3248 3248 3248 3248 3423 25567
missing_perc 0% 0% 2.33% 54.02% 54.02% 5.11% 5.11% 96.43% 96.43% 48.05% 48.05% 2.26% 0% 11.73% 44.50% 2.25% 10.33% 3.37% 68.72% 3.41% 3.45% 96.42% 96.42% 96.42% 96.71% 96.44% 96.42% 96.42% 2.26% 2.26% 46.22% 2.74% 46.21% 49.19% 46.06% 2.25% 2.25% 2.25% 2.25% 2.37% 17.73%
types numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric bool numeric numeric categorical bool numeric numeric numeric numeric numeric numeric categorical numeric numeric numeric bool bool bool categorical categorical numeric categorical categorical bool bool bool bool bool categorical

By looking at the summary of datasets, it's clear there is a lot of missing values in the dataset.

Let's get missing value stats and various other stats of columns in dataframe.

Stats on Transaction Dataset

In [18]:
from pandas_summary import DataFrameSummary
df_tran_summary = DataFrameSummary(df_tran)
df_tran_summary.summary()
Out[18]:
TransactionID isFraud TransactionDT TransactionAmt ProductCD card1 card2 card3 card4 card5 card6 addr1 addr2 dist1 dist2 P_emaildomain R_emaildomain C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 M1 M2 M3 M4 M5 M6 M7 M8 M9 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60 V61 V62 V63 V64 V65 V66 V67 V68 V69 V70 V71 V72 V73 V74 V75 V76 V77 V78 V79 V80 V81 V82 V83 V84 V85 V86 V87 V88 V89 V90 V91 V92 V93 V94 V95 V96 V97 V98 V99 V100 V101 V102 V103 V104 V105 V106 V107 V108 V109 V110 V111 V112 V113 V114 V115 V116 V117 V118 V119 V120 V121 V122 V123 V124 V125 V126 V127 V128 V129 V130 V131 V132 V133 V134 V135 V136 V137 V138 V139 V140 V141 V142 V143 V144 V145 V146 V147 V148 V149 V150 V151 V152 V153 V154 V155 V156 V157 V158 V159 V160 V161 V162 V163 V164 V165 V166 V167 V168 V169 V170 V171 V172 V173 V174 V175 V176 V177 V178 V179 V180 V181 V182 V183 V184 V185 V186 V187 V188 V189 V190 V191 V192 V193 V194 V195 V196 V197 V198 V199 V200 V201 V202 V203 V204 V205 V206 V207 V208 V209 V210 V211 V212 V213 V214 V215 V216 V217 V218 V219 V220 V221 V222 V223 V224 V225 V226 V227 V228 V229 V230 V231 V232 V233 V234 V235 V236 V237 V238 V239 V240 V241 V242 V243 V244 V245 V246 V247 V248 V249 V250 V251 V252 V253 V254 V255 V256 V257 V258 V259 V260 V261 V262 V263 V264 V265 V266 V267 V268 V269 V270 V271 V272 V273 V274 V275 V276 V277 V278 V279 V280 V281 V282 V283 V284 V285 V286 V287 V288 V289 V290 V291 V292 V293 V294 V295 V296 V297 V298 V299 V300 V301 V302 V303 V304 V305 V306 V307 V308 V309 V310 V311 V312 V313 V314 V315 V316 V317 V318 V319 V320 V321 V322 V323 V324 V325 V326 V327 V328 V329 V330 V331 V332 V333 V334 V335 V336 V337 V338 V339
count 590540 590540 590540 590540 NaN 590540 581607 588975 NaN 586281 NaN 524834 524834 238269 37627 NaN NaN 590540 590540 590540 590540 590540 590540 590540 590540 590540 590540 590540 590540 590540 590540 589271 309743 327662 421618 280699 73187 38917 74926 74926 514518 311253 64717 61952 62187 501427 NaN NaN NaN NaN NaN NaN NaN NaN NaN 311253 311253 311253 311253 311253 311253 311253 311253 311253 311253 311253 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 421571 421571 421571 421571 421571 421571 421571 421571 421571 421571 421571 421571 421571 421571 421571 421571 421571 421571 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 81945 81945 81945 81945 81945 81951 81951 81951 81945 81945 81945 81945 81951 81951 81951 81945 81945 81945 81945 81945 81945 81951 81951 81945 81945 81945 81951 81951 81951 139631 139631 139819 139819 139819 139631 139631 139819 139819 139631 139631 139631 139631 139819 139631 139631 139631 139819 139819 139631 139631 139819 139819 139631 139631 139631 139631 139819 139819 139631 139819 139819 139631 139819 139819 139631 139631 139631 139631 139631 139631 139819 139819 139819 139631 139631 139631 139631 139631 139631 130430 130430 130430 141416 141416 141416 130430 130430 130430 130430 141416 130430 130430 130430 130430 130430 130430 141416 130430 130430 130430 141416 141416 130430 130430 130430 130430 130430 141416 130430 130430 130430 130430 141416 141416 130430 130430 130430 141416 141416 130430 130430 141416 130430 130430 130430 130430 130430 130430 130430 130430 130430 130430 141416 141416 141416 130430 130430 130430 130430 130430 130430 590528 590528 589271 589271 589271 590528 590528 590528 590528 589271 589271 590528 590528 590528 590528 590528 590528 589271 590528 590528 590528 589271 589271 590528 590528 590528 590528 590528 590528 590528 590528 590528 590528 590528 589271 589271 589271 590528 590528 590528 590528 590528 590528 82351 82351 82351 82351 82351 82351 82351 82351 82351 82351 82351 82351 82351 82351 82351 82351 82351 82351
mean 3.28227e+06 0.03499 7.37231e+06 NaN NaN 9898.73 NaN NaN NaN NaN NaN NaN NaN NaN inf NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN inf NaN 0 NaN NaN inf inf inf NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 129.979 336.612 204.094 NaN NaN NaN 103.513 204.889 145.972 17.2501 38.8212 26.3651 0 NaN NaN 0 0 NaN NaN NaN 0 0 0 0 NaN NaN NaN 0 0 0 0 NaN NaN NaN 47453.2 NaN NaN NaN 877.889 2239.91 359.469 NaN NaN 0 NaN NaN 0 0 0 0 NaN NaN NaN NaN NaN 0 NaN NaN 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 444.147 1078.33 686.957 NaN NaN NaN NaN NaN NaN 385.137 765.988 536.303 38.4375 133.208 71.1071 NaN NaN NaN 0 NaN NaN 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 117.391 201.658 153.521 NaN NaN NaN NaN NaN NaN NaN NaN 107.152 NaN 31.7973 51.9566 42.3282 NaN NaN 0 NaN NaN 0 NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN 0 0 NaN NaN NaN NaN 139.749 408.682 230.413 NaN NaN NaN NaN NaN NaN NaN 109.819 247.607 162.153 18.3725 42.0731 28.3266 NaN NaN NaN 0 NaN 0 0 NaN 0 721.742 1375.78 1014.62 NaN NaN NaN 55.3524 151.161 100.701
std 170474 0.183755 4.61722e+06 NaN NaN 4901.17 NaN 0 NaN 0 NaN NaN 0 NaN inf NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN inf NaN 0 NaN NaN inf inf inf NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN 0 0 0 NaN NaN NaN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2346.95 4238.67 3010.26 NaN NaN NaN 2266.11 3796.32 2772.99 293.848 451.808 348.333 0 0 0 0 0 NaN 0 NaN 0 0 0 0 NaN 0 0 0 0 0 0 0 0 NaN 142076 NaN NaN NaN 6049.17 8223.26 1244.46 NaN NaN 0 0 0 0 0 0 0 0 NaN NaN NaN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4683.83 9105.61 6048.98 NaN NaN NaN NaN NaN NaN 4541.84 7496.12 5471.66 571.834 1040.45 680.268 NaN NaN NaN 0 NaN NaN 0 0 0 0 NaN 0 0 0 NaN NaN NaN 0 0 0 0 0 0 0 0 0 0 0 NaN 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN 0 0 0 1294.85 2284.83 1605.51 NaN NaN NaN NaN NaN NaN NaN NaN 1258.73 NaN 615.66 732.145 660.612 NaN NaN 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN 0 0 0 0 0 0 0 0 0 0 2348.85 4391.99 3021.92 NaN NaN NaN NaN NaN NaN NaN 2270.03 3980.04 2793.34 332.305 473.499 382.053 NaN NaN NaN 0 0 0 0 0 0 6217.22 11169.3 7955.74 NaN NaN NaN 668.487 1095.03 814.947
min 2.987e+06 0 86400 0.250977 NaN 1000 100 100 NaN 100 NaN 100 10 0 0 NaN NaN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -122 0 -83 0 0 0 0 -53 -83 0 -193 -83 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
25% 3.13463e+06 0 3.02706e+06 43.3125 NaN 6019 214 150 NaN 166 NaN 204 87 3 7 NaN NaN 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 26 1 0 1 0 0 0.958496 0.208374 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
50% 3.28227e+06 0 7.30653e+06 68.75 NaN 9678 361 150 NaN 226 NaN 299 87 8 37 NaN NaN 1 1 0 0 0 1 0 0 1 0 1 0 3 1 3 97 8 26 10 0 0 37.875 0.666504 15 43 0 0 0 52 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
75% 3.4299e+06 0 1.12466e+07 125 NaN 14184 512 150 NaN 226 NaN 330 87 24 206 NaN NaN 3 3 0 0 1 2 0 0 2 0 2 0 12 2 122 276 27 253 32 40 17 188 0.833496 197 274 13 0 2 314 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0 0 1 1 1 0 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 107.95 0 0 59 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 30.9244 20 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 33.5935 20.8975 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 151.381 35.97 0 107.938 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 25 0 0 0 0 0 0 0
max 3.57754e+06 1 1.58111e+07 31936 NaN 18396 600 231 NaN 237 NaN 540 102 10288 11624 NaN NaN 4684 5692 26 2252 349 2252 2256 3332 210 3256 3188 3188 2918 1429 640 640 819 869 819 873 843 1708 0.958496 876 670 648 847 878 879 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 8 9 6 6 9 9 8 8 4 5 3 6 1 7 15 15 15 7 15 5 8 13 13 7 13 4 4 5 9 7 15 7 13 3 5 54 54 15 24 1 8 8 48 48 6 12 5 5 5 6 12 5 6 17 51 6 10 16 16 6 10 7 7 1 7 8 2 5 6 6 10 7 8 4 6 30 31 7 19 19 7 7 7 7 30 30 1 2 5 6 7 7 2 880 1410 976 12 88 28 869 1285 928 15 99 55 1 7 7 7 9 9 9 6 6 6 3 3 3 3 3 3 13 13 13 160000 160000 160000 55136 55136 55136 93736 133915 98476 90750 90750 90750 22 33 33 5 9 869 62 297 24 26 20 20 3388 57 69 18 18 24 24 24 24 55136 641511 3300 3300 3300 93736 98476 104060 872 964 19 48 61 31 7 8 14 48 861 1235 920 83 24 83 41 16 31 38 218 30 30 42 21 44 37 7 16 38 14 21 45 45 55 104060 139777 104060 55136 55136 55136 3300 8048 3300 92888 129006 97628 104060 104060 104060 303 400 378 25 384 384 16 144 51 242 360 54 176 65 293 337 332 121 23 45 39 23 23 7 5 20 57 22 262 45 18 36 22 18 18 24 163 60 87 87 48 66 285 8 49 20 153600 153600 153600 55136 55136 55136 55136 4000 4000 4000 51200 66000 51200 104060 104060 104060 880 975 22 32 68 12 95 8 31 10 12 67 1055 323 869 1286 928 93 12 93 49 11 13 16 20 16 2 108800 145765 108800 55136 55136 55136 55136 4816 7520 4816 93736 134021 98476 104060 104060 104060 880 1411 976 12 44 18 15 99 55 160000 160000 160000 55136 55136 55136 104060 104060 104060
counts 590540 590540 590540 590540 590540 590540 581607 588975 588963 586281 588969 524834 524834 238269 37627 496084 137291 590540 590540 590540 590540 590540 590540 590540 590540 590540 590540 590540 590540 590540 590540 589271 309743 327662 421618 280699 73187 38917 74926 74926 514518 311253 64717 61952 62187 501427 319440 319440 319440 309096 240058 421180 244275 244288 244288 311253 311253 311253 311253 311253 311253 311253 311253 311253 311253 311253 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 514467 421571 421571 421571 421571 421571 421571 421571 421571 421571 421571 421571 421571 421571 421571 421571 421571 421571 421571 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 513444 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 501376 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 590226 81945 81945 81945 81945 81945 81951 81951 81951 81945 81945 81945 81945 81951 81951 81951 81945 81945 81945 81945 81945 81945 81951 81951 81945 81945 81945 81951 81951 81951 139631 139631 139819 139819 139819 139631 139631 139819 139819 139631 139631 139631 139631 139819 139631 139631 139631 139819 139819 139631 139631 139819 139819 139631 139631 139631 139631 139819 139819 139631 139819 139819 139631 139819 139819 139631 139631 139631 139631 139631 139631 139819 139819 139819 139631 139631 139631 139631 139631 139631 130430 130430 130430 141416 141416 141416 130430 130430 130430 130430 141416 130430 130430 130430 130430 130430 130430 141416 130430 130430 130430 141416 141416 130430 130430 130430 130430 130430 141416 130430 130430 130430 130430 141416 141416 130430 130430 130430 141416 141416 130430 130430 141416 130430 130430 130430 130430 130430 130430 130430 130430 130430 130430 141416 141416 141416 130430 130430 130430 130430 130430 130430 590528 590528 589271 589271 589271 590528 590528 590528 590528 589271 589271 590528 590528 590528 590528 590528 590528 589271 590528 590528 590528 589271 589271 590528 590528 590528 590528 590528 590528 590528 590528 590528 590528 590528 589271 589271 589271 590528 590528 590528 590528 590528 590528 82351 82351 82351 82351 82351 82351 82351 82351 82351 82351 82351 82351 82351 82351 82351 82351 82351 82351
uniques 590540 2 573349 8195 5 13553 500 114 4 119 4 332 74 2412 1699 59 60 1495 1167 27 1223 319 1291 1069 1130 205 1122 1343 1066 1464 1108 641 641 649 808 688 829 597 5367 24 818 676 635 577 802 859 2 2 2 3 2 2 2 2 2 2 9 10 7 7 10 10 9 9 5 6 4 7 2 8 15 16 16 8 15 6 9 14 14 7 13 4 4 6 8 8 15 7 13 4 6 55 55 16 18 2 9 9 49 49 7 9 6 6 6 7 9 6 7 18 52 7 11 17 17 7 11 8 8 2 8 9 3 6 7 7 11 8 9 5 7 31 32 8 20 20 8 8 8 8 31 31 2 3 6 7 8 8 3 881 1410 976 13 89 29 870 1285 928 16 100 56 2 8 8 8 10 10 10 7 7 7 4 4 4 4 4 4 14 14 14 10299 24414 14507 1608 5511 3097 6560 9949 8178 3724 4852 4252 23 34 34 6 10 870 63 260 25 27 21 21 1344 56 39 19 19 25 25 25 25 2492 9621 79 185 106 1978 2547 987 873 965 20 49 62 32 8 9 15 49 862 1236 921 84 25 84 42 17 32 39 215 31 31 43 22 45 38 8 17 39 15 22 46 46 56 10970 14951 12858 1953 1581 2705 2093 2674 2262 7624 8868 8317 2282 2747 2532 304 401 379 26 77 76 17 79 35 81 50 55 91 66 294 338 333 122 24 46 40 24 24 6 5 21 43 23 58 46 19 23 23 19 19 25 66 45 46 48 49 67 68 9 41 21 10422 13358 11757 1871 2884 2286 151 1972 2286 2082 4689 8315 4965 2263 2540 2398 881 975 23 33 62 13 96 9 32 11 13 58 219 173 870 1286 928 94 13 94 50 12 14 17 21 17 2 16210 37367 23064 3239 7759 2526 5143 3915 5974 4540 9814 15184 12309 4799 6439 5560 881 1411 976 13 45 19 16 100 56 1758 2453 1971 143 669 355 254 380 334
missing 0 0 0 0 0 0 8933 1565 1577 4259 1571 65706 65706 352271 552913 94456 453249 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1269 280797 262878 168922 309841 517353 551623 515614 515614 76022 279287 525823 528588 528353 89113 271100 271100 271100 281444 350482 169360 346265 346252 346252 279287 279287 279287 279287 279287 279287 279287 279287 279287 279287 279287 76073 76073 76073 76073 76073 76073 76073 76073 76073 76073 76073 76073 76073 76073 76073 76073 76073 76073 76073 76073 76073 76073 76073 168969 168969 168969 168969 168969 168969 168969 168969 168969 168969 168969 168969 168969 168969 168969 168969 168969 168969 77096 77096 77096 77096 77096 77096 77096 77096 77096 77096 77096 77096 77096 77096 77096 77096 77096 77096 77096 77096 77096 77096 89164 89164 89164 89164 89164 89164 89164 89164 89164 89164 89164 89164 89164 89164 89164 89164 89164 89164 89164 89164 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 314 508595 508595 508595 508595 508595 508589 508589 508589 508595 508595 508595 508595 508589 508589 508589 508595 508595 508595 508595 508595 508595 508589 508589 508595 508595 508595 508589 508589 508589 450909 450909 450721 450721 450721 450909 450909 450721 450721 450909 450909 450909 450909 450721 450909 450909 450909 450721 450721 450909 450909 450721 450721 450909 450909 450909 450909 450721 450721 450909 450721 450721 450909 450721 450721 450909 450909 450909 450909 450909 450909 450721 450721 450721 450909 450909 450909 450909 450909 450909 460110 460110 460110 449124 449124 449124 460110 460110 460110 460110 449124 460110 460110 460110 460110 460110 460110 449124 460110 460110 460110 449124 449124 460110 460110 460110 460110 460110 449124 460110 460110 460110 460110 449124 449124 460110 460110 460110 449124 449124 460110 460110 449124 460110 460110 460110 460110 460110 460110 460110 460110 460110 460110 449124 449124 449124 460110 460110 460110 460110 460110 460110 12 12 1269 1269 1269 12 12 12 12 1269 1269 12 12 12 12 12 12 1269 12 12 12 1269 1269 12 12 12 12 12 12 12 12 12 12 12 1269 1269 1269 12 12 12 12 12 12 508189 508189 508189 508189 508189 508189 508189 508189 508189 508189 508189 508189 508189 508189 508189 508189 508189 508189
missing_perc 0% 0% 0% 0% 0% 0% 1.51% 0.27% 0.27% 0.72% 0.27% 11.13% 11.13% 59.65% 93.63% 15.99% 76.75% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0.21% 47.55% 44.51% 28.60% 52.47% 87.61% 93.41% 87.31% 87.31% 12.87% 47.29% 89.04% 89.51% 89.47% 15.09% 45.91% 45.91% 45.91% 47.66% 59.35% 28.68% 58.64% 58.63% 58.63% 47.29% 47.29% 47.29% 47.29% 47.29% 47.29% 47.29% 47.29% 47.29% 47.29% 47.29% 12.88% 12.88% 12.88% 12.88% 12.88% 12.88% 12.88% 12.88% 12.88% 12.88% 12.88% 12.88% 12.88% 12.88% 12.88% 12.88% 12.88% 12.88% 12.88% 12.88% 12.88% 12.88% 12.88% 28.61% 28.61% 28.61% 28.61% 28.61% 28.61% 28.61% 28.61% 28.61% 28.61% 28.61% 28.61% 28.61% 28.61% 28.61% 28.61% 28.61% 28.61% 13.06% 13.06% 13.06% 13.06% 13.06% 13.06% 13.06% 13.06% 13.06% 13.06% 13.06% 13.06% 13.06% 13.06% 13.06% 13.06% 13.06% 13.06% 13.06% 13.06% 13.06% 13.06% 15.10% 15.10% 15.10% 15.10% 15.10% 15.10% 15.10% 15.10% 15.10% 15.10% 15.10% 15.10% 15.10% 15.10% 15.10% 15.10% 15.10% 15.10% 15.10% 15.10% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 0.05% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 86.12% 76.36% 76.36% 76.32% 76.32% 76.32% 76.36% 76.36% 76.32% 76.32% 76.36% 76.36% 76.36% 76.36% 76.32% 76.36% 76.36% 76.36% 76.32% 76.32% 76.36% 76.36% 76.32% 76.32% 76.36% 76.36% 76.36% 76.36% 76.32% 76.32% 76.36% 76.32% 76.32% 76.36% 76.32% 76.32% 76.36% 76.36% 76.36% 76.36% 76.36% 76.36% 76.32% 76.32% 76.32% 76.36% 76.36% 76.36% 76.36% 76.36% 76.36% 77.91% 77.91% 77.91% 76.05% 76.05% 76.05% 77.91% 77.91% 77.91% 77.91% 76.05% 77.91% 77.91% 77.91% 77.91% 77.91% 77.91% 76.05% 77.91% 77.91% 77.91% 76.05% 76.05% 77.91% 77.91% 77.91% 77.91% 77.91% 76.05% 77.91% 77.91% 77.91% 77.91% 76.05% 76.05% 77.91% 77.91% 77.91% 76.05% 76.05% 77.91% 77.91% 76.05% 77.91% 77.91% 77.91% 77.91% 77.91% 77.91% 77.91% 77.91% 77.91% 77.91% 76.05% 76.05% 76.05% 77.91% 77.91% 77.91% 77.91% 77.91% 77.91% 0.00% 0.00% 0.21% 0.21% 0.21% 0.00% 0.00% 0.00% 0.00% 0.21% 0.21% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.21% 0.00% 0.00% 0.00% 0.21% 0.21% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.21% 0.21% 0.21% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 86.05% 86.05% 86.05% 86.05% 86.05% 86.05% 86.05% 86.05% 86.05% 86.05% 86.05% 86.05% 86.05% 86.05% 86.05% 86.05% 86.05% 86.05%
types numeric bool numeric numeric categorical numeric numeric numeric categorical numeric categorical numeric numeric numeric numeric categorical categorical numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric bool bool bool categorical bool bool bool bool bool bool numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric bool numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric bool numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric bool numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric bool numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric bool numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric bool numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric

Check class imbalance

In [19]:
df_tran.loc[:, 'isFraud'].value_counts()
Out[19]:
0    569877
1     20663
Name: isFraud, dtype: int64

Inferences:

Lot of interesting things can be observed here:

  • Rows in identity dataset are less than transaction dataset, that means only a subset of transactions in transactions dataset has identity data
  • Both datasets have the common and unique key as TransactionID, both can be joined at this unique key
  • id_24, id_25, dist2, D7 and many more columns have 90%+ missing values, which means that these columns are probably useless so need to drop it for now
  • Columns from V1 to V339 in transaction dataset are numeric whereas columns from id_01 to id_39 are of mixed datatype
  • TransactionDT column is a timedelta from a given reference datetime (not an actual timestamp). But reference datetime is not known, so need to assume it and convert it to date format
  • Target class is imbalanced. So no need to drop the columns where one category contains the majority of rows

5. Data Preprocessing for EDA

The goal of this section is to:

  • Merge two datasets
  • Drop the columns based on the inferences from previous section
  • Create date features from transaction date

Let's start with the first task to merge datasets to form one.

Merge the datasets

In [20]:
# Merge transaction dataset and identity dataset 
df = df_tran.merge(df_id, how='left', left_index=True, right_index=True, on='TransactionID')

del df_tran, df_id

gc.collect()
Out[20]:
0

Get dimensions of training dataset

In [21]:
# Dimentions of data
df.shape
Out[21]:
(590540, 434)

Since left join was performed on transaction dataset, number of rows are same as transaction dataset.

Add missing flag

In [22]:
# Add flag column for missing values
for col in df.columns:
    df[col+"_missing_flag"] = df[col].isnull()
    
df.head()
Out[22]:
TransactionID isFraud TransactionDT TransactionAmt ProductCD card1 card2 card3 card4 card5 card6 addr1 addr2 dist1 dist2 P_emaildomain R_emaildomain C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 M1 M2 M3 M4 M5 M6 M7 M8 M9 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60 V61 V62 V63 V64 V65 V66 V67 V68 V69 V70 V71 V72 V73 V74 V75 V76 V77 V78 V79 V80 V81 V82 V83 V84 V85 V86 V87 V88 V89 V90 V91 V92 V93 V94 V95 V96 V97 V98 V99 V100 V101 V102 V103 V104 V105 V106 V107 V108 V109 V110 V111 V112 V113 V114 V115 V116 V117 V118 V119 V120 V121 V122 V123 V124 V125 V126 V127 V128 V129 V130 V131 V132 V133 V134 V135 V136 V137 V138 V139 V140 V141 V142 V143 V144 V145 V146 V147 V148 V149 V150 V151 V152 V153 V154 V155 V156 V157 V158 V159 V160 V161 V162 V163 V164 V165 V166 V167 V168 V169 V170 V171 V172 V173 V174 V175 V176 V177 V178 V179 V180 V181 V182 V183 V184 V185 V186 V187 V188 V189 V190 V191 V192 V193 V194 V195 ... V130_missing_flag V131_missing_flag V132_missing_flag V133_missing_flag V134_missing_flag V135_missing_flag V136_missing_flag V137_missing_flag V138_missing_flag V139_missing_flag V140_missing_flag V141_missing_flag V142_missing_flag V143_missing_flag V144_missing_flag V145_missing_flag V146_missing_flag V147_missing_flag V148_missing_flag V149_missing_flag V150_missing_flag V151_missing_flag V152_missing_flag V153_missing_flag V154_missing_flag V155_missing_flag V156_missing_flag V157_missing_flag V158_missing_flag V159_missing_flag V160_missing_flag V161_missing_flag V162_missing_flag V163_missing_flag V164_missing_flag V165_missing_flag V166_missing_flag V167_missing_flag V168_missing_flag V169_missing_flag V170_missing_flag V171_missing_flag V172_missing_flag V173_missing_flag V174_missing_flag V175_missing_flag V176_missing_flag V177_missing_flag V178_missing_flag V179_missing_flag V180_missing_flag V181_missing_flag V182_missing_flag V183_missing_flag V184_missing_flag V185_missing_flag V186_missing_flag V187_missing_flag V188_missing_flag V189_missing_flag V190_missing_flag V191_missing_flag V192_missing_flag V193_missing_flag V194_missing_flag V195_missing_flag V196_missing_flag V197_missing_flag V198_missing_flag V199_missing_flag V200_missing_flag V201_missing_flag V202_missing_flag V203_missing_flag V204_missing_flag V205_missing_flag V206_missing_flag V207_missing_flag V208_missing_flag V209_missing_flag V210_missing_flag V211_missing_flag V212_missing_flag V213_missing_flag V214_missing_flag V215_missing_flag V216_missing_flag V217_missing_flag V218_missing_flag V219_missing_flag V220_missing_flag V221_missing_flag V222_missing_flag V223_missing_flag V224_missing_flag V225_missing_flag V226_missing_flag V227_missing_flag V228_missing_flag V229_missing_flag V230_missing_flag V231_missing_flag V232_missing_flag V233_missing_flag V234_missing_flag V235_missing_flag V236_missing_flag V237_missing_flag V238_missing_flag V239_missing_flag V240_missing_flag V241_missing_flag V242_missing_flag V243_missing_flag V244_missing_flag V245_missing_flag V246_missing_flag V247_missing_flag V248_missing_flag V249_missing_flag V250_missing_flag V251_missing_flag V252_missing_flag V253_missing_flag V254_missing_flag V255_missing_flag V256_missing_flag V257_missing_flag V258_missing_flag V259_missing_flag V260_missing_flag V261_missing_flag V262_missing_flag V263_missing_flag V264_missing_flag V265_missing_flag V266_missing_flag V267_missing_flag V268_missing_flag V269_missing_flag V270_missing_flag V271_missing_flag V272_missing_flag V273_missing_flag V274_missing_flag V275_missing_flag V276_missing_flag V277_missing_flag V278_missing_flag V279_missing_flag V280_missing_flag V281_missing_flag V282_missing_flag V283_missing_flag V284_missing_flag V285_missing_flag V286_missing_flag V287_missing_flag V288_missing_flag V289_missing_flag V290_missing_flag V291_missing_flag V292_missing_flag V293_missing_flag V294_missing_flag V295_missing_flag V296_missing_flag V297_missing_flag V298_missing_flag V299_missing_flag V300_missing_flag V301_missing_flag V302_missing_flag V303_missing_flag V304_missing_flag V305_missing_flag V306_missing_flag V307_missing_flag V308_missing_flag V309_missing_flag V310_missing_flag V311_missing_flag V312_missing_flag V313_missing_flag V314_missing_flag V315_missing_flag V316_missing_flag V317_missing_flag V318_missing_flag V319_missing_flag V320_missing_flag V321_missing_flag V322_missing_flag V323_missing_flag V324_missing_flag V325_missing_flag V326_missing_flag V327_missing_flag V328_missing_flag V329_missing_flag V330_missing_flag V331_missing_flag V332_missing_flag V333_missing_flag V334_missing_flag V335_missing_flag V336_missing_flag V337_missing_flag V338_missing_flag V339_missing_flag id_01_missing_flag id_02_missing_flag id_03_missing_flag id_04_missing_flag id_05_missing_flag id_06_missing_flag id_07_missing_flag id_08_missing_flag id_09_missing_flag id_10_missing_flag id_11_missing_flag id_12_missing_flag id_13_missing_flag id_14_missing_flag id_15_missing_flag id_16_missing_flag id_17_missing_flag id_18_missing_flag id_19_missing_flag id_20_missing_flag id_21_missing_flag id_22_missing_flag id_23_missing_flag id_24_missing_flag id_25_missing_flag id_26_missing_flag id_27_missing_flag id_28_missing_flag id_29_missing_flag id_30_missing_flag id_31_missing_flag id_32_missing_flag id_33_missing_flag id_34_missing_flag id_35_missing_flag id_36_missing_flag id_37_missing_flag id_38_missing_flag DeviceType_missing_flag DeviceInfo_missing_flag
0 2987000 0 86400 68.5 W 13926 NaN 150.0 discover 142.0 credit 315.0 87.0 19.0 NaN NaN NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 2.0 0.0 1.0 1.0 14.0 NaN 13.0 NaN NaN NaN NaN NaN NaN 13.0 13.0 NaN NaN NaN 0.0 T T T M2 F T NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 117.0 0.0 0.0 0.0 0.0 0.0 117.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... False False False False False False False False True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False True True True True True True True True False False True False False False False True False False True True True True True True True False False False False False False False False False False False False False
1 2987001 0 86401 29.0 W 2755 404.0 150.0 mastercard 102.0 credit 325.0 87.0 NaN NaN gmail.com NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN NaN 0.0 NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN NaN NaN M0 T T NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... False False False False False False False False True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False True True False False True True True True False False False False False False False True False False True True True True True True True False False False False False False False False False False False False False
2 2987002 0 86469 59.0 W 4663 490.0 150.0 visa 166.0 debit 330.0 87.0 287.0 NaN outlook.com NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN NaN 0.0 NaN NaN NaN NaN NaN 0.0 315.0 NaN NaN NaN 315.0 T T T M0 F F F F F 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... False False False False False False False False True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False False False False False True True False False False False False True False False False True False False True True True True True True True False False True False True True True False False False False False False
3 2987003 0 86499 50.0 W 18132 567.0 150.0 mastercard 117.0 debit 476.0 87.0 NaN NaN yahoo.com NaN 2.0 5.0 0.0 0.0 0.0 4.0 0.0 0.0 1.0 0.0 1.0 0.0 25.0 1.0 112.0 112.0 0.0 94.0 0.0 NaN NaN NaN NaN 84.0 NaN NaN NaN NaN 111.0 NaN NaN NaN M0 T F NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 48.0 28.0 0.0 10.0 4.0 1.0 38.0 24.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 50.0 1758.0 925.0 0.0 354.0 135.0 50.0 1404.0 790.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... False False False False False False False False True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False True True False False True True True True False False False True False False False True False False True True True True True True True False False True False True True True False False False False False True
4 2987004 0 86506 50.0 H 4497 514.0 150.0 mastercard 102.0 credit 420.0 87.0 NaN NaN gmail.com NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.0 18.0 140.0 0.0 0.0 0.0 0.0 1803.0 49.0 64.0 0.0 0.0 0.0 0.0 0.0 0.0 15560.0 169690.796875 0.0 0.0 0.0 515.0 5155.0 2840.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True False False False False True False False False False False False False True True True True True True True False False False False False False False False False False False False False

5 rows × 868 columns

Clean Data

Let's drop the columns which may not be useful for our analysis

Create a missing value flag column for the columns we are dropping which have more than 90% missing values, there might be some specific pattern associated with missing values and transaction being fraud

In [25]:
# Drop the columns where one category contains more than 90% values
drop_cols = []

for col in df.columns:
    missing_share = df[col].isnull().sum()/df.shape[0]
    if missing_share > 0.9:
        drop_cols.append(col)
        print(col)
        # df[col + "_missing_flag"] = df[col].isnull()
    
good_cols = [col for col in df.columns if col not in drop_cols]    
dist2
D7
id_07
id_08
id_18
id_21
id_22
id_23
id_24
id_25
id_26
id_27

Remove the columns which doesn't having any variance

In [26]:
# Drop the columns which have only one unique value
drop_cols = []
for col in good_cols:
    unique_value = df[col].nunique()
    if unique_value == 1:
        drop_cols.append(col)
        print(col)
good_cols = [col for col in good_cols if col not in drop_cols]
TransactionID_missing_flag
isFraud_missing_flag
TransactionDT_missing_flag
TransactionAmt_missing_flag
ProductCD_missing_flag
card1_missing_flag
C1_missing_flag
C2_missing_flag
C3_missing_flag
C4_missing_flag
C5_missing_flag
C6_missing_flag
C7_missing_flag
C8_missing_flag
C9_missing_flag
C10_missing_flag
C11_missing_flag
C12_missing_flag
C13_missing_flag
C14_missing_flag

Filter the dataset with only good columns

In [27]:
# Filter the data for relevant columns only
df = df[good_cols]

Get dimentions of training dataset

In [28]:
# Dimentions of data
df.shape
Out[28]:
(590540, 836)

Create date features

Let's create date features from TransactionDT features

In [29]:
# Date features
START_DATE         = '2017-12-01'
startdate          = datetime.datetime.strptime(START_DATE, "%Y-%m-%d")
df["Date"]         = df['TransactionDT'].apply(lambda x: (startdate + datetime.timedelta(seconds=x)))

df['_Weekdays']    = df['Date'].dt.dayofweek
df['_Hours']       = df['Date'].dt.hour
df['_Days']        = df['Date'].dt.day
In [30]:
df = reduce_mem_usage(df)
Mem. usage decreased to 1449.38 Mb (0.8% reduction)

6. Exploratory Data Analysis

Exploratory data analysis is an approach to analyze or investigate data sets to find out patterns and see if any of the variables can be useful to explain / predict the y variables.

Visual methods are often used to summarise the data. Primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing tasks.

The goal of this section is to:

  • Check if the target variable is balanced or is there a need to balance the target variable
  • Analyze the transaction amount
  • Get insights or relationships from the data which would be useful from business perspective.

Check distribution of target variable

In [31]:
# Get count of target class
df['isFraud'].value_counts()
Out[31]:
0    569877
1     20663
Name: isFraud, dtype: int64

Let's check the distribution of target class using a bar plot and check the proportion of transactions amounts being fraud

In [32]:
# Draw a countplot to check the distribution of target variable
df['TransactionAmt'] = df['TransactionAmt'].astype(float)
total = len(df)
total_amt = df.groupby(['isFraud'])['TransactionAmt'].sum().sum()
plt.figure(figsize=(16,6))

plt.subplot(121)
g = sns.countplot(x='isFraud', data=df )
g.set_title("Fraud Transactions Distribution \n 0: No Fraud | 1: Fraud", fontsize=18)
g.set_xlabel("Is fraud?", fontsize=18)
g.set_ylabel('Count', fontsize=18)
for p in g.patches:
    height = p.get_height()
    g.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=15) 

perc_amt = (df.groupby(['isFraud'])['TransactionAmt'].sum())
perc_amt = perc_amt.reset_index()

plt.subplot(122)
g1 = sns.barplot(x='isFraud', y='TransactionAmt',  dodge=True, data=perc_amt)
g1.set_title("% Total Amount in Transaction Amt \n 0: No Fraud | 1: Fraud", fontsize=18)
g1.set_xlabel("Is fraud?", fontsize=18)
g1.set_ylabel('Total Transaction Amount Scalar', fontsize=18)
for p in g1.patches:
    height = p.get_height()
    g1.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/total_amt * 100),
            ha="center", fontsize=15) 
    
plt.show()
In [37]:
# Average transaction amount by Y
df.groupby('isFraud')['TransactionAmt'].mean()
Out[37]:
isFraud
0    134.511857
1    149.244353
Name: TransactionAmt, dtype: float64

Inferences:

  • The target variable is imbalanced. 3.5% transactions are Fraud
  • Around same % of transaction amounts are fraud

Let's explore the Transaction amount further

Check distribution of Transaction Amount

In [29]:
# Distribution plot of Transaction Amount
plt.figure(figsize=(16,12))

sns.distplot(df['TransactionAmt'])
plt.title("Transaction Amount Distribution",fontsize=18)
plt.ylabel("Probability")
Out[29]:
Text(0, 0.5, 'Probability')

There are certain transactions which are of very high amount, let's remove those to check the distribution

In [30]:
# Distribution plot of Transaction Amount less than 1000
plt.figure(figsize=(16,12))

plt.suptitle('Transaction Values Distribution', fontsize=22)
sns.distplot(df[df['TransactionAmt'] <= 1000]['TransactionAmt'])
plt.title("Transaction Amount Distribuition <= 1000", fontsize=18)
plt.xlabel("Transaction Amount", fontsize=15)
plt.ylabel("Probability", fontsize=15)

plt.show()

Most transactions lie in < $200 range

Transaction amount is right skewed.

Let's look at the log of transaction amount

In [31]:
# Distribution plot of Transaction Amount less than 1000
plt.figure(figsize=(16,12))

plt.suptitle('Transaction Values Distribution', fontsize=22)
sns.distplot(np.log(df['TransactionAmt']))
plt.title("Transaction Amount (Log) Distribuition", fontsize=18)
plt.xlabel("Transaction Amount", fontsize=15)
plt.ylabel("Probability", fontsize=15)

plt.show()

Inferences:

  • Transaction Amount is right skewed.
  • Log of transaction amount is almost normally distributed, so use log of transaction amount while building the model

Product Features

  • Distribution of ProductCD
  • Distribution of Frauds by Product
In [32]:
def plot_cat_feat_dist(df, col):
    tmp = pd.crosstab(df[col], df['isFraud'], normalize='index') * 100
    tmp = tmp.reset_index()
    tmp.rename(columns={0:'NoFraud', 1:'Fraud'}, inplace=True)

    plt.figure(figsize=(16,12))
    plt.suptitle(f'{col} Distributions', fontsize=22)

    plt.subplot(221)
    g = sns.countplot(x=col, data=df, order=tmp[col].values)

    g.set_title(f"{col} Distribution", fontsize=16)
    g.set_xlabel(f"{col} Name", fontsize=17)
    g.set_ylabel("Count", fontsize=17)
    for p in g.patches:
        height = p.get_height()
        g.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(height/total*100),
                ha="center", fontsize=14) 

    plt.subplot(222)
    g1 = sns.countplot(x=col, hue='isFraud', data=df, order=tmp[col].values)
    plt.legend(title='Fraud', loc='best', labels=['No', 'Yes'])
    gt = g1.twinx()
    gt = sns.pointplot(x=col, y='Fraud', data=tmp, color='black', order=tmp[col].values, legend=False)
    gt.set_ylabel("% of Fraud Transactions", fontsize=16)

    g1.set_title(f"{col} Distribution by Target Variable (isFraud) ", fontsize=16)
    g1.set_xlabel(f"{col} Name", fontsize=17)
    g1.set_ylabel("Count", fontsize=17)


    plt.subplots_adjust(hspace = 0.4, top = 0.85)

    plt.show()
In [33]:
plot_cat_feat_dist(df, "ProductCD")
In [40]:
# Average fraud per transaction by ProductCD
df.groupby('ProductCD')['isFraud'].mean()
Out[40]:
ProductCD
C    0.116873
H    0.047662
R    0.037826
S    0.058996
W    0.020399
Name: isFraud, dtype: float64

Inferences:

  • 75% of the transactions are for Product Catergory W
  • 11.6% of the transactions are for Product Category C
  • Fraud Transaction rate is maximum for Product Category C and minimum for Product Category W

Card Features

In [34]:
# Card 4
plot_cat_feat_dist(df, "card4")
In [41]:
# Average fraud per transaction by Card4
df.groupby('card4')['isFraud'].mean()
Out[41]:
card4
american express    0.028698
discover            0.077282
mastercard          0.034331
visa                0.034756
Name: isFraud, dtype: float64

Inferences:

  • 97% of transactions are from Mastercard(32%) and Visa(65%
  • Fraud transaction rate is highest for discover cards(~8%) against ~3.5% of Mastercard and Visa and 2.87% in American Express
In [35]:
# Card 6
plot_cat_feat_dist(df, "card6")
In [42]:
# Average fraud per transaction by Card6
df.groupby('card6')['isFraud'].mean()
Out[42]:
card6
charge card        0.000000
credit             0.066785
debit              0.024263
debit or credit    0.000000
Name: isFraud, dtype: float64

Inferences:

  • Almost all the transactions are from Credit and Debit cards.
  • Debit card transactions are almost 3 times as compared to credit card transactions.
  • Fraud transaction rate is high for Credit cards as compared to Debit cards.

P emaildomain

  • It has multiple domains, let's group them by the respective enterprises
  • Set all values with less than 500 entries as "Others"
In [36]:
df.loc[df['P_emaildomain'].isin(['gmail.com', 'gmail']),'P_emaildomain'] = 'Google'

df.loc[df['P_emaildomain'].isin(['yahoo.com', 'yahoo.com.mx',  'yahoo.co.uk',
                                         'yahoo.co.jp', 'yahoo.de', 'yahoo.fr',
                                         'yahoo.es']), 'P_emaildomain'] = 'Yahoo Mail'
df.loc[df['P_emaildomain'].isin(['hotmail.com','outlook.com','msn.com', 'live.com.mx', 
                                         'hotmail.es','hotmail.co.uk', 'hotmail.de',
                                         'outlook.es', 'live.com', 'live.fr',
                                         'hotmail.fr']), 'P_emaildomain'] = 'Microsoft'
df.loc[df.P_emaildomain.isin(df.P_emaildomain\
                                         .value_counts()[df.P_emaildomain.value_counts() <= 500 ]\
                                         .index), 'P_emaildomain'] = "Others"
df.P_emaildomain.fillna("NoInf", inplace=True)
In [49]:
def plot_cat_with_amt(df, col, lim=2000):
    tmp = pd.crosstab(df[col], df['isFraud'], normalize='index') * 100
    tmp = tmp.reset_index()
    tmp.rename(columns={0:'NoFraud', 1:'Fraud'}, inplace=True)
    
    plt.figure(figsize=(16,14))    
    plt.suptitle(f'{col} Distributions ', fontsize=24)
    
    plt.subplot(211)
    g = sns.countplot( x=col,  data=df, order=list(tmp[col].values))
    gt = g.twinx()
    gt = sns.pointplot(x=col, y='Fraud', data=tmp, order=list(tmp[col].values),
                       color='black', legend=False, )
    gt.set_ylim(0,tmp['Fraud'].max()*1.1)
    gt.set_ylabel("%Fraud Transactions", fontsize=16)
    g.set_title(f"Share of {col} categories and % of Fraud Transactions", fontsize=18)
    g.set_xlabel(f"{col} Category Names", fontsize=16)
    g.set_ylabel("Count", fontsize=17)
    g.set_xticklabels(g.get_xticklabels(),rotation=45)
    sizes = []
    for p in g.patches:
        height = p.get_height()
        sizes.append(height)
        g.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(height/total*100),
                ha="center",fontsize=12) 
        
    g.set_ylim(0,max(sizes)*1.15)
    
    #########################################################################
    perc_amt = (df.groupby(['isFraud',col])['TransactionAmt'].sum() \
                / df.groupby([col])['TransactionAmt'].sum() * 100).unstack('isFraud')
    perc_amt = perc_amt.reset_index()
    perc_amt.rename(columns={0:'NoFraud', 1:'Fraud'}, inplace=True)
    amt = df.groupby([col])['TransactionAmt'].sum().reset_index()
    perc_amt = perc_amt.fillna(0)
    plt.subplot(212)
    g1 = sns.barplot(x=col, y='TransactionAmt', 
                       data=amt, 
                       order=list(tmp[col].values))
    g1t = g1.twinx()
    g1t = sns.pointplot(x=col, y='Fraud', data=perc_amt, 
                        order=list(tmp[col].values),
                       color='black', legend=False, )
    g1t.set_ylim(0,perc_amt['Fraud'].max()*1.1)
    g1t.set_ylabel("%Fraud Total Amount", fontsize=16)
    g.set_xticklabels(g.get_xticklabels(),rotation=45)
    g1.set_title(f"Transactions amount by {col} categories and % of Fraud Transactions (Amounts)", fontsize=18)
    g1.set_xlabel(f"{col} Category Names", fontsize=16)
    g1.set_ylabel("Transaction Total Amount(U$)", fontsize=16)
    g1.set_xticklabels(g.get_xticklabels(),rotation=45)    
    
    for p in g1.patches:
        height = p.get_height()
        g1.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(height/total_amt*100),
                ha="center",fontsize=12) 
        
    plt.subplots_adjust(hspace=.4, top = 0.9)
    plt.show()
    
In [38]:
plot_cat_with_amt(df, 'P_emaildomain')
In [44]:
# Average fraud per transaction by Card6
df.groupby('P_emaildomain')['isFraud'].mean()
Out[44]:
P_emaildomain
aim.com             0.126984
anonymous.com       0.023217
aol.com             0.021811
att.net             0.007439
bellsouth.net       0.027763
cableone.net        0.018868
centurylink.net     0.000000
cfl.rr.com          0.000000
charter.net         0.030637
comcast.net         0.031187
cox.net             0.020818
earthlink.net       0.021401
embarqmail.com      0.034615
frontier.com        0.028571
frontiernet.net     0.025641
gmail               0.022177
gmail.com           0.043542
gmx.de              0.000000
hotmail.co.uk       0.000000
hotmail.com         0.052950
hotmail.de          0.000000
hotmail.es          0.065574
hotmail.fr          0.000000
icloud.com          0.031434
juno.com            0.018634
live.com            0.027622
live.com.mx         0.054740
live.fr             0.000000
mac.com             0.032110
mail.com            0.189624
me.com              0.017740
msn.com             0.021994
netzero.com         0.000000
netzero.net         0.005102
optonline.net       0.016815
outlook.com         0.094584
outlook.es          0.130137
prodigy.net.mx      0.004831
protonmail.com      0.407895
ptd.net             0.000000
q.com               0.000000
roadrunner.com      0.009836
rocketmail.com      0.003012
sbcglobal.net       0.004040
sc.rr.com           0.006098
servicios-ta.com    0.000000
suddenlink.net      0.022857
twc.com             0.000000
verizon.net         0.008133
web.de              0.000000
windstream.net      0.000000
yahoo.co.jp         0.000000
yahoo.co.uk         0.000000
yahoo.com           0.022757
yahoo.com.mx        0.010369
yahoo.de            0.000000
yahoo.es            0.014925
yahoo.fr            0.034965
ymail.com           0.020868
Name: isFraud, dtype: float64

Inferences:

  • Majority of transactions are with P_emaildomain as Google, Microsoft and Yahoo Mail
  • There isn't any information about P_emaildomain of around 16% transactions in terms of count and 14.11% in terms of amount
  • Fraud transaction rate for Microsoft is high as compared to Google and Yahoo mail
  • Fraud transaction rate (amount) for Google is high as comapred to Microsoft and Yahoo mail

R-Email Domain

  • It has multiple domains, let's group them by the respective enterprises
  • Set all values with less than 500 entries as "Others"
In [50]:
df.loc[df['R_emaildomain'].isin(['gmail.com', 'gmail']),'R_emaildomain'] = 'Google'

df.loc[df['R_emaildomain'].isin(['yahoo.com', 'yahoo.com.mx',  'yahoo.co.uk',
                                             'yahoo.co.jp', 'yahoo.de', 'yahoo.fr',
                                             'yahoo.es']), 'R_emaildomain'] = 'Yahoo Mail'
df.loc[df['R_emaildomain'].isin(['hotmail.com','outlook.com','msn.com', 'live.com.mx', 
                                             'hotmail.es','hotmail.co.uk', 'hotmail.de',
                                             'outlook.es', 'live.com', 'live.fr',
                                             'hotmail.fr']), 'R_emaildomain'] = 'Microsoft'
df.loc[df.R_emaildomain.isin(df.R_emaildomain\
                                         .value_counts()[df.R_emaildomain.value_counts() <= 300 ]\
                                         .index), 'R_emaildomain'] = "Others"
df.R_emaildomain.fillna("NoInf", inplace=True)
In [40]:
plot_cat_with_amt(df, 'R_emaildomain')
In [45]:
# Average fraud per transaction by Card6
df.groupby('R_emaildomain')['isFraud'].mean()
Out[45]:
R_emaildomain
aim.com             0.027778
anonymous.com       0.029130
aol.com             0.034855
att.net             0.000000
bellsouth.net       0.004739
cableone.net        0.000000
centurylink.net     0.000000
cfl.rr.com          0.000000
charter.net         0.039370
comcast.net         0.011589
cox.net             0.023965
earthlink.net       0.075949
embarqmail.com      0.000000
frontier.com        0.000000
frontiernet.net     0.000000
gmail               0.000000
gmail.com           0.119184
gmx.de              0.000000
hotmail.co.uk       0.000000
hotmail.com         0.077793
hotmail.de          0.000000
hotmail.es          0.068493
hotmail.fr          0.000000
icloud.com          0.128755
juno.com            0.000000
live.com            0.049869
live.com.mx         0.058355
live.fr             0.000000
mac.com             0.009174
mail.com            0.377049
me.com              0.019784
msn.com             0.001174
netzero.com         0.000000
netzero.net         0.222222
optonline.net       0.010695
outlook.com         0.165138
outlook.es          0.131640
prodigy.net.mx      0.004831
protonmail.com      0.951220
ptd.net             0.000000
q.com               0.000000
roadrunner.com      0.000000
rocketmail.com      0.043478
sbcglobal.net       0.001812
sc.rr.com           0.000000
scranton.edu        0.000000
servicios-ta.com    0.000000
suddenlink.net      0.040000
twc.com             0.000000
verizon.net         0.000000
web.de              0.000000
windstream.net      0.000000
yahoo.co.jp         0.000000
yahoo.co.uk         0.000000
yahoo.com           0.051512
yahoo.com.mx        0.010610
yahoo.de            0.000000
yahoo.es            0.035088
yahoo.fr            0.036496
ymail.com           0.038647
Name: isFraud, dtype: float64

Inferences:

  • There isn't any information about R_emaildomain for Majority of transactions (76.75% count , 85.62% amount)
  • Fraud transaction rate for Google is high as compared to Yahoo, anaonymous.com and Microsoft

Days of the Month

Reference date is not known, it has been assumed. So can't say concretely if the day number is correct

In [41]:
plot_cat_with_amt(df, '_Days')

Inferences:

  • The perc of fraud transactions is highest towards the beginning and the end of the month. Might be accelerated at the time of receiving pay-checks.

  • Incidentally, fraud transaction rate is high on the days when number of transactions are less

  • Day 29,30 and 31 are having less transactions, looks like people are cautious with spending in those times.

Days of the week

Reference date is not known, it has been assumed. So can't say concretely if the day number is correct

In [42]:
plot_cat_with_amt(df, '_Weekdays')

Inferences:

  • Surprisingly fraud transaction rate is high on the days when number of transactions and transaction amounts are less. Day 0 and 6
  • Day 0 and 6 have less transactions, these might be weekend days

Hour of the Day

In [43]:
plot_cat_with_amt(df, '_Hours')

Inferences:

  • Transactions start decreasing mid night but the fraud rate starts increasing
  • Transactions from 3 AM to 12 PM needs to monitored very closely

Device Type

In [44]:
plot_cat_with_amt(df, "DeviceType")

Inferences:

  • Device type is known for only 24% of the transactions
  • Due to lack of data points, we can't infer from this analysis

Columns from identity data

In [51]:
for col in ['id_12', 'id_15', 'id_16', 'id_28', 'id_29']:
    df[col] = df[col].fillna('NaN')
    plot_cat_with_amt(df, col)
In [52]:
df.loc[df['id_30'].str.contains('Windows', na=False), 'id_30'] = 'Windows'
df.loc[df['id_30'].str.contains('iOS', na=False), 'id_30'] = 'iOS'
df.loc[df['id_30'].str.contains('Mac OS', na=False), 'id_30'] = 'Mac'
df.loc[df['id_30'].str.contains('Android', na=False), 'id_30'] = 'Android'
df['id_30'].fillna("NAN", inplace=True)

plot_cat_with_amt(df, "id_30")
In [53]:
df.loc[df['id_31'].str.contains('chrome', na=False), 'id_31'] = 'Chrome'
df.loc[df['id_31'].str.contains('firefox', na=False), 'id_31'] = 'Firefox'
df.loc[df['id_31'].str.contains('safari', na=False), 'id_31'] = 'Safari'
df.loc[df['id_31'].str.contains('edge', na=False), 'id_31'] = 'Edge'
df.loc[df['id_31'].str.contains('ie', na=False), 'id_31'] = 'IE'
df.loc[df['id_31'].str.contains('samsung', na=False), 'id_31'] = 'Samsung'
df.loc[df['id_31'].str.contains('opera', na=False), 'id_31'] = 'Opera'
df['id_31'].fillna("NAN", inplace=True)
df.loc[df.id_31.isin(df.id_31.value_counts()[df.id_31.value_counts() < 200].index), 'id_31'] = "Others"
plot_cat_with_amt(df, "id_31")

Get column names

In [54]:
cat_columns = df.select_dtypes(include=['object']).columns
len(cat_columns)
Out[54]:
29
In [55]:
binary_columns = [col for col in df.columns if df[col].nunique() == 2]
len(binary_columns)
Out[55]:
435
In [56]:
num_columns = [col for col in df.columns if (col not in cat_columns) & (col not in binary_columns)]
len(num_columns)
Out[56]:
389
In [57]:
cat_columns = cat_columns.to_list() + binary_columns

7. Statistical Significance test

Chi square test for categorical columns

In [58]:
from   scipy.stats import chi2_contingency
In [59]:
# significance value
alpha = 0.05

significant_categorical_variables = []

for col in cat_columns:  
    # Create a crosstab table
    temp = pd.crosstab(df[col],df['isFraud'].astype('category'))
    
    # Get chi-square value , p-value, degrees of freedom, expected frequencies using the function chi2_contingency
    stat, p, dof, expected = chi2_contingency(temp)
    
    # Determine whether to reject or keep your null hypothesis
    print(col.ljust(40), ',  chisquared=%.5f,   p-value=%.5f' % (stat, p))
    if p <= alpha:
        significant_categorical_variables.append(col)
    else:
        ""
ProductCD                                ,  chisquared=16742.17153,   p-value=0.00000
card4                                    ,  chisquared=364.87414,   p-value=0.00000
card6                                    ,  chisquared=5957.03229,   p-value=0.00000
P_emaildomain                            ,  chisquared=3497.81283,   p-value=0.00000
R_emaildomain                            ,  chisquared=17297.50859,   p-value=0.00000
M1                                       ,  chisquared=0.00003,   p-value=0.99581
M2                                       ,  chisquared=438.61321,   p-value=0.00000
M3                                       ,  chisquared=477.66057,   p-value=0.00000
M4                                       ,  chisquared=6450.44798,   p-value=0.00000
M5                                       ,  chisquared=242.42169,   p-value=0.00000
M6                                       ,  chisquared=227.96414,   p-value=0.00000
M7                                       ,  chisquared=11.25610,   p-value=0.00079
M8                                       ,  chisquared=88.53022,   p-value=0.00000
M9                                       ,  chisquared=250.37250,   p-value=0.00000
id_12                                    ,  chisquared=429.84996,   p-value=0.00000
id_15                                    ,  chisquared=421.13420,   p-value=0.00000
id_16                                    ,  chisquared=366.45700,   p-value=0.00000
id_28                                    ,  chisquared=420.29657,   p-value=0.00000
id_29                                    ,  chisquared=420.22938,   p-value=0.00000
id_30                                    ,  chisquared=176.98386,   p-value=0.00000
id_31                                    ,  chisquared=424.58675,   p-value=0.00000
id_33                                    ,  chisquared=212.82695,   p-value=0.98357
id_34                                    ,  chisquared=11.47415,   p-value=0.00942
id_35                                    ,  chisquared=2.26392,   p-value=0.13242
id_36                                    ,  chisquared=0.02303,   p-value=0.87939
id_37                                    ,  chisquared=1.24550,   p-value=0.26441
id_38                                    ,  chisquared=2.35329,   p-value=0.12502
DeviceType                               ,  chisquared=0.39659,   p-value=0.52885
DeviceInfo                               ,  chisquared=1476.08487,   p-value=1.00000
isFraud                                  ,  chisquared=590510.38453,   p-value=0.00000
M1                                       ,  chisquared=0.00003,   p-value=0.99581
M2                                       ,  chisquared=438.61321,   p-value=0.00000
M3                                       ,  chisquared=477.66057,   p-value=0.00000
M5                                       ,  chisquared=242.42169,   p-value=0.00000
M6                                       ,  chisquared=227.96414,   p-value=0.00000
M7                                       ,  chisquared=11.25610,   p-value=0.00079
M8                                       ,  chisquared=88.53022,   p-value=0.00000
M9                                       ,  chisquared=250.37250,   p-value=0.00000
V1                                       ,  chisquared=0.08480,   p-value=0.77090
V14                                      ,  chisquared=1.85823,   p-value=0.17283
V41                                      ,  chisquared=6.45761,   p-value=0.01105
V65                                      ,  chisquared=2.95009,   p-value=0.08587
V88                                      ,  chisquared=0.06115,   p-value=0.80468
V107                                     ,  chisquared=3.20035,   p-value=0.07362
V305                                     ,  chisquared=0.95990,   p-value=0.32721
id_35                                    ,  chisquared=2.26392,   p-value=0.13242
id_36                                    ,  chisquared=0.02303,   p-value=0.87939
id_37                                    ,  chisquared=1.24550,   p-value=0.26441
id_38                                    ,  chisquared=2.35329,   p-value=0.12502
DeviceType                               ,  chisquared=0.39659,   p-value=0.52885
card2_missing_flag                       ,  chisquared=40.68296,   p-value=0.00000
card3_missing_flag                       ,  chisquared=4.41810,   p-value=0.03556
card4_missing_flag                       ,  chisquared=3.52353,   p-value=0.06050
card5_missing_flag                       ,  chisquared=25.61819,   p-value=0.00000
card6_missing_flag                       ,  chisquared=4.52321,   p-value=0.03344
addr1_missing_flag                       ,  chisquared=15016.72347,   p-value=0.00000
addr2_missing_flag                       ,  chisquared=15016.72347,   p-value=0.00000
dist1_missing_flag                       ,  chisquared=2672.80512,   p-value=0.00000
dist2_missing_flag                       ,  chisquared=4898.54088,   p-value=0.00000
P_emaildomain_missing_flag               ,  chisquared=98.80683,   p-value=0.00000
R_emaildomain_missing_flag               ,  chisquared=11593.85862,   p-value=0.00000
D1_missing_flag                          ,  chisquared=0.02818,   p-value=0.86669
D2_missing_flag                          ,  chisquared=1770.65814,   p-value=0.00000
D3_missing_flag                          ,  chisquared=691.46860,   p-value=0.00000
D4_missing_flag                          ,  chisquared=8.39702,   p-value=0.00376
D5_missing_flag                          ,  chisquared=225.02626,   p-value=0.00000
D6_missing_flag                          ,  chisquared=12282.70285,   p-value=0.00000
D7_missing_flag                          ,  chisquared=15972.26519,   p-value=0.00000
D8_missing_flag                          ,  chisquared=12263.96026,   p-value=0.00000
D9_missing_flag                          ,  chisquared=12263.96026,   p-value=0.00000
D10_missing_flag                         ,  chisquared=671.51048,   p-value=0.00000
D11_missing_flag                         ,  chisquared=4605.06821,   p-value=0.00000
D12_missing_flag                         ,  chisquared=14617.29683,   p-value=0.00000
D13_missing_flag                         ,  chisquared=11641.55968,   p-value=0.00000
D14_missing_flag                         ,  chisquared=13502.69648,   p-value=0.00000
D15_missing_flag                         ,  chisquared=524.34621,   p-value=0.00000
M1_missing_flag                          ,  chisquared=4720.57858,   p-value=0.00000
M2_missing_flag                          ,  chisquared=4720.57858,   p-value=0.00000
M3_missing_flag                          ,  chisquared=4720.57858,   p-value=0.00000
M4_missing_flag                          ,  chisquared=4291.55730,   p-value=0.00000
M5_missing_flag                          ,  chisquared=143.24698,   p-value=0.00000
M6_missing_flag                          ,  chisquared=8958.36720,   p-value=0.00000
M7_missing_flag                          ,  chisquared=2876.27247,   p-value=0.00000
M8_missing_flag                          ,  chisquared=2876.92899,   p-value=0.00000
M9_missing_flag                          ,  chisquared=2876.92899,   p-value=0.00000
V1_missing_flag                          ,  chisquared=4605.06821,   p-value=0.00000
V2_missing_flag                          ,  chisquared=4605.06821,   p-value=0.00000
V3_missing_flag                          ,  chisquared=4605.06821,   p-value=0.00000
V4_missing_flag                          ,  chisquared=4605.06821,   p-value=0.00000
V5_missing_flag                          ,  chisquared=4605.06821,   p-value=0.00000
V6_missing_flag                          ,  chisquared=4605.06821,   p-value=0.00000
V7_missing_flag                          ,  chisquared=4605.06821,   p-value=0.00000
V8_missing_flag                          ,  chisquared=4605.06821,   p-value=0.00000
V9_missing_flag                          ,  chisquared=4605.06821,   p-value=0.00000
V10_missing_flag                         ,  chisquared=4605.06821,   p-value=0.00000
V11_missing_flag                         ,  chisquared=4605.06821,   p-value=0.00000
V12_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V13_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V14_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V15_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V16_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V17_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V18_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V19_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V20_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V21_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V22_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V23_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V24_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V25_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V26_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V27_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V28_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V29_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V30_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V31_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V32_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V33_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V34_missing_flag                         ,  chisquared=670.26786,   p-value=0.00000
V35_missing_flag                         ,  chisquared=8.24695,   p-value=0.00408
V36_missing_flag                         ,  chisquared=8.24695,   p-value=0.00408
V37_missing_flag                         ,  chisquared=8.24695,   p-value=0.00408
V38_missing_flag                         ,  chisquared=8.24695,   p-value=0.00408
V39_missing_flag                         ,  chisquared=8.24695,   p-value=0.00408
V40_missing_flag                         ,  chisquared=8.24695,   p-value=0.00408
V41_missing_flag                         ,  chisquared=8.24695,   p-value=0.00408
V42_missing_flag                         ,  chisquared=8.24695,   p-value=0.00408
V43_missing_flag                         ,  chisquared=8.24695,   p-value=0.00408
V44_missing_flag                         ,  chisquared=8.24695,   p-value=0.00408
V45_missing_flag                         ,  chisquared=8.24695,   p-value=0.00408
V46_missing_flag                         ,  chisquared=8.24695,   p-value=0.00408
V47_missing_flag                         ,  chisquared=8.24695,   p-value=0.00408
V48_missing_flag                         ,  chisquared=8.24695,   p-value=0.00408
V49_missing_flag                         ,  chisquared=8.24695,   p-value=0.00408
V50_missing_flag                         ,  chisquared=8.24695,   p-value=0.00408
V51_missing_flag                         ,  chisquared=8.24695,   p-value=0.00408
V52_missing_flag                         ,  chisquared=8.24695,   p-value=0.00408
V53_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V54_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V55_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V56_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V57_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V58_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V59_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V60_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V61_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V62_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V63_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V64_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V65_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V66_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V67_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V68_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V69_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V70_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V71_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V72_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V73_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V74_missing_flag                         ,  chisquared=1448.91324,   p-value=0.00000
V75_missing_flag                         ,  chisquared=522.48477,   p-value=0.00000
V76_missing_flag                         ,  chisquared=522.48477,   p-value=0.00000
V77_missing_flag                         ,  chisquared=522.48477,   p-value=0.00000
V78_missing_flag                         ,  chisquared=522.48477,   p-value=0.00000
V79_missing_flag                         ,  chisquared=522.48477,   p-value=0.00000
V80_missing_flag                         ,  chisquared=522.48477,   p-value=0.00000
V81_missing_flag                         ,  chisquared=522.48477,   p-value=0.00000
V82_missing_flag                         ,  chisquared=522.48477,   p-value=0.00000
V83_missing_flag                         ,  chisquared=522.48477,   p-value=0.00000
V84_missing_flag                         ,  chisquared=522.48477,   p-value=0.00000
V85_missing_flag                         ,  chisquared=522.48477,   p-value=0.00000
V86_missing_flag                         ,  chisquared=522.48477,   p-value=0.00000
V87_missing_flag                         ,  chisquared=522.48477,   p-value=0.00000
V88_missing_flag                         ,  chisquared=522.48477,   p-value=0.00000
V89_missing_flag                         ,  chisquared=522.48477,   p-value=0.00000
V90_missing_flag                         ,  chisquared=522.48477,   p-value=0.00000
V91_missing_flag                         ,  chisquared=522.48477,   p-value=0.00000
V92_missing_flag                         ,  chisquared=522.48477,   p-value=0.00000
V93_missing_flag                         ,  chisquared=522.48477,   p-value=0.00000
V94_missing_flag                         ,  chisquared=522.48477,   p-value=0.00000
V95_missing_flag                         ,  chisquared=2.86829,   p-value=0.09034
V96_missing_flag                         ,  chisquared=2.86829,   p-value=0.09034
V97_missing_flag                         ,  chisquared=2.86829,   p-value=0.09034
V98_missing_flag                         ,  chisquared=2.86829,   p-value=0.09034
V99_missing_flag                         ,  chisquared=2.86829,   p-value=0.09034
V100_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V101_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V102_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V103_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V104_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V105_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V106_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V107_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V108_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V109_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V110_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V111_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V112_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V113_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V114_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V115_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V116_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V117_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V118_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V119_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V120_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V121_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V122_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V123_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V124_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V125_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V126_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V127_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V128_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V129_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V130_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V131_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V132_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V133_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V134_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V135_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V136_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V137_missing_flag                        ,  chisquared=2.86829,   p-value=0.09034
V138_missing_flag                        ,  chisquared=256.78110,   p-value=0.00000
V139_missing_flag                        ,  chisquared=256.78110,   p-value=0.00000
V140_missing_flag                        ,  chisquared=256.78110,   p-value=0.00000
V141_missing_flag                        ,  chisquared=256.78110,   p-value=0.00000
V142_missing_flag                        ,  chisquared=256.78110,   p-value=0.00000
V143_missing_flag                        ,  chisquared=257.28425,   p-value=0.00000
V144_missing_flag                        ,  chisquared=257.28425,   p-value=0.00000
V145_missing_flag                        ,  chisquared=257.28425,   p-value=0.00000
V146_missing_flag                        ,  chisquared=256.78110,   p-value=0.00000
V147_missing_flag                        ,  chisquared=256.78110,   p-value=0.00000
V148_missing_flag                        ,  chisquared=256.78110,   p-value=0.00000
V149_missing_flag                        ,  chisquared=256.78110,   p-value=0.00000
V150_missing_flag                        ,  chisquared=257.28425,   p-value=0.00000
V151_missing_flag                        ,  chisquared=257.28425,   p-value=0.00000
V152_missing_flag                        ,  chisquared=257.28425,   p-value=0.00000
V153_missing_flag                        ,  chisquared=256.78110,   p-value=0.00000
V154_missing_flag                        ,  chisquared=256.78110,   p-value=0.00000
V155_missing_flag                        ,  chisquared=256.78110,   p-value=0.00000
V156_missing_flag                        ,  chisquared=256.78110,   p-value=0.00000
V157_missing_flag                        ,  chisquared=256.78110,   p-value=0.00000
V158_missing_flag                        ,  chisquared=256.78110,   p-value=0.00000
V159_missing_flag                        ,  chisquared=257.28425,   p-value=0.00000
V160_missing_flag                        ,  chisquared=257.28425,   p-value=0.00000
V161_missing_flag                        ,  chisquared=256.78110,   p-value=0.00000
V162_missing_flag                        ,  chisquared=256.78110,   p-value=0.00000
V163_missing_flag                        ,  chisquared=256.78110,   p-value=0.00000
V164_missing_flag                        ,  chisquared=257.28425,   p-value=0.00000
V165_missing_flag                        ,  chisquared=257.28425,   p-value=0.00000
V166_missing_flag                        ,  chisquared=257.28425,   p-value=0.00000
V167_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V168_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V169_missing_flag                        ,  chisquared=10589.78839,   p-value=0.00000
V170_missing_flag                        ,  chisquared=10589.78839,   p-value=0.00000
V171_missing_flag                        ,  chisquared=10589.78839,   p-value=0.00000
V172_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V173_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V174_missing_flag                        ,  chisquared=10589.78839,   p-value=0.00000
V175_missing_flag                        ,  chisquared=10589.78839,   p-value=0.00000
V176_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V177_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V178_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V179_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V180_missing_flag                        ,  chisquared=10589.78839,   p-value=0.00000
V181_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V182_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V183_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V184_missing_flag                        ,  chisquared=10589.78839,   p-value=0.00000
V185_missing_flag                        ,  chisquared=10589.78839,   p-value=0.00000
V186_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V187_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V188_missing_flag                        ,  chisquared=10589.78839,   p-value=0.00000
V189_missing_flag                        ,  chisquared=10589.78839,   p-value=0.00000
V190_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V191_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V192_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V193_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V194_missing_flag                        ,  chisquared=10589.78839,   p-value=0.00000
V195_missing_flag                        ,  chisquared=10589.78839,   p-value=0.00000
V196_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V197_missing_flag                        ,  chisquared=10589.78839,   p-value=0.00000
V198_missing_flag                        ,  chisquared=10589.78839,   p-value=0.00000
V199_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V200_missing_flag                        ,  chisquared=10589.78839,   p-value=0.00000
V201_missing_flag                        ,  chisquared=10589.78839,   p-value=0.00000
V202_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V203_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V204_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V205_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V206_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V207_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V208_missing_flag                        ,  chisquared=10589.78839,   p-value=0.00000
V209_missing_flag                        ,  chisquared=10589.78839,   p-value=0.00000
V210_missing_flag                        ,  chisquared=10589.78839,   p-value=0.00000
V211_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V212_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V213_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V214_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V215_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V216_missing_flag                        ,  chisquared=10515.97912,   p-value=0.00000
V217_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V218_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V219_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V220_missing_flag                        ,  chisquared=10023.65463,   p-value=0.00000
V221_missing_flag                        ,  chisquared=10023.65463,   p-value=0.00000
V222_missing_flag                        ,  chisquared=10023.65463,   p-value=0.00000
V223_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V224_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V225_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V226_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V227_missing_flag                        ,  chisquared=10023.65463,   p-value=0.00000
V228_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V229_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V230_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V231_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V232_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V233_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V234_missing_flag                        ,  chisquared=10023.65463,   p-value=0.00000
V235_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V236_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V237_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V238_missing_flag                        ,  chisquared=10023.65463,   p-value=0.00000
V239_missing_flag                        ,  chisquared=10023.65463,   p-value=0.00000
V240_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V241_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V242_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V243_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V244_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V245_missing_flag                        ,  chisquared=10023.65463,   p-value=0.00000
V246_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V247_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V248_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V249_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V250_missing_flag                        ,  chisquared=10023.65463,   p-value=0.00000
V251_missing_flag                        ,  chisquared=10023.65463,   p-value=0.00000
V252_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V253_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V254_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V255_missing_flag                        ,  chisquared=10023.65463,   p-value=0.00000
V256_missing_flag                        ,  chisquared=10023.65463,   p-value=0.00000
V257_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V258_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V259_missing_flag                        ,  chisquared=10023.65463,   p-value=0.00000
V260_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V261_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V262_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V263_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V264_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V265_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V266_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V267_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V268_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V269_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V270_missing_flag                        ,  chisquared=10023.65463,   p-value=0.00000
V271_missing_flag                        ,  chisquared=10023.65463,   p-value=0.00000
V272_missing_flag                        ,  chisquared=10023.65463,   p-value=0.00000
V273_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V274_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V275_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V276_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V277_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V278_missing_flag                        ,  chisquared=9230.05652,   p-value=0.00000
V279_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V280_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V281_missing_flag                        ,  chisquared=0.02818,   p-value=0.86669
V282_missing_flag                        ,  chisquared=0.02818,   p-value=0.86669
V283_missing_flag                        ,  chisquared=0.02818,   p-value=0.86669
V284_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V285_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V286_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V287_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V288_missing_flag                        ,  chisquared=0.02818,   p-value=0.86669
V289_missing_flag                        ,  chisquared=0.02818,   p-value=0.86669
V290_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V291_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V292_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V293_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V294_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V295_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V296_missing_flag                        ,  chisquared=0.02818,   p-value=0.86669
V297_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V298_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V299_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V300_missing_flag                        ,  chisquared=0.02818,   p-value=0.86669
V301_missing_flag                        ,  chisquared=0.02818,   p-value=0.86669
V302_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V303_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V304_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V305_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V306_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V307_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V308_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V309_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V310_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V311_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V312_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V313_missing_flag                        ,  chisquared=0.02818,   p-value=0.86669
V314_missing_flag                        ,  chisquared=0.02818,   p-value=0.86669
V315_missing_flag                        ,  chisquared=0.02818,   p-value=0.86669
V316_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V317_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V318_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V319_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V320_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V321_missing_flag                        ,  chisquared=2.87936,   p-value=0.08972
V322_missing_flag                        ,  chisquared=270.16694,   p-value=0.00000
V323_missing_flag                        ,  chisquared=270.16694,   p-value=0.00000
V324_missing_flag                        ,  chisquared=270.16694,   p-value=0.00000
V325_missing_flag                        ,  chisquared=270.16694,   p-value=0.00000
V326_missing_flag                        ,  chisquared=270.16694,   p-value=0.00000
V327_missing_flag                        ,  chisquared=270.16694,   p-value=0.00000
V328_missing_flag                        ,  chisquared=270.16694,   p-value=0.00000
V329_missing_flag                        ,  chisquared=270.16694,   p-value=0.00000
V330_missing_flag                        ,  chisquared=270.16694,   p-value=0.00000
V331_missing_flag                        ,  chisquared=270.16694,   p-value=0.00000
V332_missing_flag                        ,  chisquared=270.16694,   p-value=0.00000
V333_missing_flag                        ,  chisquared=270.16694,   p-value=0.00000
V334_missing_flag                        ,  chisquared=270.16694,   p-value=0.00000
V335_missing_flag                        ,  chisquared=270.16694,   p-value=0.00000
V336_missing_flag                        ,  chisquared=270.16694,   p-value=0.00000
V337_missing_flag                        ,  chisquared=270.16694,   p-value=0.00000
V338_missing_flag                        ,  chisquared=270.16694,   p-value=0.00000
V339_missing_flag                        ,  chisquared=270.16694,   p-value=0.00000
id_01_missing_flag                       ,  chisquared=429.43113,   p-value=0.00000
id_02_missing_flag                       ,  chisquared=418.79840,   p-value=0.00000
id_03_missing_flag                       ,  chisquared=155.60340,   p-value=0.00000
id_04_missing_flag                       ,  chisquared=155.60340,   p-value=0.00000
id_05_missing_flag                       ,  chisquared=410.62370,   p-value=0.00000
id_06_missing_flag                       ,  chisquared=410.62370,   p-value=0.00000
id_07_missing_flag                       ,  chisquared=11.67033,   p-value=0.00064
id_08_missing_flag                       ,  chisquared=11.67033,   p-value=0.00064
id_09_missing_flag                       ,  chisquared=186.68294,   p-value=0.00000
id_10_missing_flag                       ,  chisquared=186.68294,   p-value=0.00000
id_11_missing_flag                       ,  chisquared=419.74461,   p-value=0.00000
id_12_missing_flag                       ,  chisquared=429.43113,   p-value=0.00000
id_13_missing_flag                       ,  chisquared=372.26945,   p-value=0.00000
id_14_missing_flag                       ,  chisquared=186.57626,   p-value=0.00000
id_15_missing_flag                       ,  chisquared=419.89703,   p-value=0.00000
id_16_missing_flag                       ,  chisquared=365.88234,   p-value=0.00000
id_17_missing_flag                       ,  chisquared=421.50054,   p-value=0.00000
id_18_missing_flag                       ,  chisquared=108.11199,   p-value=0.00000
id_19_missing_flag                       ,  chisquared=421.07000,   p-value=0.00000
id_20_missing_flag                       ,  chisquared=419.82460,   p-value=0.00000
id_21_missing_flag                       ,  chisquared=11.73422,   p-value=0.00061
id_22_missing_flag                       ,  chisquared=11.89450,   p-value=0.00056
id_23_missing_flag                       ,  chisquared=11.89450,   p-value=0.00056
id_24_missing_flag                       ,  chisquared=9.86160,   p-value=0.00169
id_25_missing_flag                       ,  chisquared=12.35494,   p-value=0.00044
id_26_missing_flag                       ,  chisquared=11.79823,   p-value=0.00059
id_27_missing_flag                       ,  chisquared=11.89450,   p-value=0.00056
id_28_missing_flag                       ,  chisquared=419.74461,   p-value=0.00000
id_29_missing_flag                       ,  chisquared=419.74461,   p-value=0.00000
id_30_missing_flag                       ,  chisquared=175.29045,   p-value=0.00000
id_31_missing_flag                       ,  chisquared=417.52693,   p-value=0.00000
id_32_missing_flag                       ,  chisquared=175.65824,   p-value=0.00000
id_33_missing_flag                       ,  chisquared=171.60347,   p-value=0.00000
id_34_missing_flag                       ,  chisquared=180.63611,   p-value=0.00000
id_35_missing_flag                       ,  chisquared=419.89703,   p-value=0.00000
id_36_missing_flag                       ,  chisquared=419.89703,   p-value=0.00000
id_37_missing_flag                       ,  chisquared=419.89703,   p-value=0.00000
id_38_missing_flag                       ,  chisquared=419.89703,   p-value=0.00000
DeviceType_missing_flag                  ,  chisquared=417.45054,   p-value=0.00000
DeviceInfo_missing_flag                  ,  chisquared=317.74598,   p-value=0.00000
In [62]:
# Significant variables
# print(significant_categorical_variables)

Calculate odds

Chi-Square test tells if the entire variable is useful or not.

In [67]:
ctab = pd.crosstab(df['ProductCD'], df['isFraud'].astype('category'))
ctab
Out[67]:
isFraud 0 1
ProductCD
C 60511 8008
H 31450 1574
R 36273 1426
S 10942 686
W 430701 8969

Odds

In [68]:
ctab.columns = ctab.columns.add_categories('odds')
ctab['odds'] = ctab[1]/ctab[0]
ctab
Out[68]:
isFraud 0 1 odds
ProductCD
C 60511 8008 0.132340
H 31450 1574 0.050048
R 36273 1426 0.039313
S 10942 686 0.062694
W 430701 8969 0.020824

Odds Ratio

In [69]:
ctab.columns = ctab.columns.add_categories('odds_ratio')
ctab['odds_ratio'] = ctab['odds'] / (ctab[1].sum()/ctab[0].sum())
ctab
Out[69]:
isFraud 0 1 odds odds_ratio
ProductCD
C 60511 8008 0.132340 3.649871
H 31450 1574 0.050048 1.380295
R 36273 1426 0.039313 1.084236
S 10942 686 0.062694 1.729080
W 430701 8969 0.020824 0.574323

Highers odds ratio implies more chance of fraud in that category.

Farther away it is from 1.0 (both directions) more important the variable is.

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

8. ANOVA Test

In [ ]:
from scipy.stats import f_oneway
In [74]:
# significance value
alpha = 0.05

significant_numerical_variables = []
for col in num_columns[2:]:
    # Determine whether to reject or keep your null hypothesis
    if df.loc[:, col].nunique() > 50:
        F, p = f_oneway(df[df.isFraud == 1][col].dropna(),
                    df[df.isFraud == 0][col].dropna())
        print(col.ljust(40), ',   F-statistic=%.5f, p=%.5f' % (F, p), df.loc[:, col].nunique())
        if p <= alpha:
            significant_numerical_variables.append(col)
TransactionAmt                           ,   F-statistic=75.67718, p=0.00000 8195
card1                                    ,   F-statistic=109.88932, p=0.00000 13553
card2                                    ,   F-statistic=6.67558, p=0.00977 500
card3                                    ,   F-statistic=14336.21578, p=0.00000 114
card5                                    ,   F-statistic=661.82659, p=0.00000 119
addr1                                    ,   F-statistic=16.43435, p=0.00005 332
addr2                                    ,   F-statistic=485.06660, p=0.00000 74
dist1                                    ,   F-statistic=110.42966, p=0.00000 2412
C1                                       ,   F-statistic=552.37693, p=0.00000 1495
C2                                       ,   F-statistic=819.61843, p=0.00000 1167
C4                                       ,   F-statistic=545.60992, p=0.00000 1223
C5                                       ,   F-statistic=559.06343, p=0.00000 319
C6                                       ,   F-statistic=258.28478, p=0.00000 1291
C7                                       ,   F-statistic=468.69042, p=0.00000 1069
C8                                       ,   F-statistic=610.57329, p=0.00000 1130
C9                                       ,   F-statistic=594.15081, p=0.00000 205
C10                                      ,   F-statistic=476.56495, p=0.00000 1122
C11                                      ,   F-statistic=446.39851, p=0.00000 1343
C12                                      ,   F-statistic=601.74057, p=0.00000 1066
C13                                      ,   F-statistic=73.37132, p=0.00000 1464
C14                                      ,   F-statistic=37.04987, p=0.00000 1108
D1                                       ,   F-statistic=2672.56121, p=0.00000 641
D2                                       ,   F-statistic=2179.12220, p=0.00000 641
D3                                       ,   F-statistic=703.03864, p=0.00000 649
D4                                       ,   F-statistic=1913.51964, p=0.00000 808
D5                                       ,   F-statistic=1177.67893, p=0.00000 688
D6                                       ,   F-statistic=240.53556, p=0.00000 829
D8                                       ,   F-statistic=1555.93433, p=0.00000 5367
D10                                      ,   F-statistic=2681.29292, p=0.00000 818
D11                                      ,   F-statistic=634.20722, p=0.00000 676
D12                                      ,   F-statistic=53.96035, p=0.00000 635
D13                                      ,   F-statistic=219.58119, p=0.00000 577
D14                                      ,   F-statistic=4.66673, p=0.03076 802
D15                                      ,   F-statistic=3031.40845, p=0.00000 859
V37                                      ,   F-statistic=13626.03770, p=0.00000 55
V38                                      ,   F-statistic=17383.75607, p=0.00000 55
V56                                      ,   F-statistic=1940.19590, p=0.00000 52
V95                                      ,   F-statistic=10.01732, p=0.00155 881
V96                                      ,   F-statistic=17.75408, p=0.00003 1410
V97                                      ,   F-statistic=11.90221, p=0.00056 976
V99                                      ,   F-statistic=102.52878, p=0.00000 89
V101                                     ,   F-statistic=13.10561, p=0.00029 870
V102                                     ,   F-statistic=13.87026, p=0.00020 1285
V103                                     ,   F-statistic=15.47712, p=0.00008 928
V105                                     ,   F-statistic=6.15509, p=0.01310 100
V106                                     ,   F-statistic=2.58112, p=0.10815 56
V126                                     ,   F-statistic=7.41985, p=0.00645 10299
V127                                     ,   F-statistic=4.07834, p=0.04344 24414
V128                                     ,   F-statistic=2.19911, p=0.13809 14507
V129                                     ,   F-statistic=95.66082, p=0.00000 1608
V130                                     ,   F-statistic=5.58774, p=0.01809 5511
V131                                     ,   F-statistic=368.35409, p=0.00000 3097
V132                                     ,   F-statistic=10.72701, p=0.00106 6560
V133                                     ,   F-statistic=5.91363, p=0.01502 9949
V134                                     ,   F-statistic=7.27959, p=0.00697 8178
V135                                     ,   F-statistic=0.04505, p=0.83192 3724
V136                                     ,   F-statistic=0.00105, p=0.97410 4852
V137                                     ,   F-statistic=0.00778, p=0.92972 4252
V143                                     ,   F-statistic=72.67628, p=0.00000 870
V144                                     ,   F-statistic=221.93537, p=0.00000 63
V145                                     ,   F-statistic=315.85937, p=0.00000 260
V150                                     ,   F-statistic=336.66794, p=0.00000 1344
V151                                     ,   F-statistic=304.74317, p=0.00000 56
V159                                     ,   F-statistic=323.02425, p=0.00000 2492
V160                                     ,   F-statistic=319.70138, p=0.00000 9621
V161                                     ,   F-statistic=130.04611, p=0.00000 79
V162                                     ,   F-statistic=233.47479, p=0.00000 185
V163                                     ,   F-statistic=177.81176, p=0.00000 106
V164                                     ,   F-statistic=61.77344, p=0.00000 1978
V165                                     ,   F-statistic=217.06800, p=0.00000 2547
V166                                     ,   F-statistic=71.06800, p=0.00000 987
V167                                     ,   F-statistic=23.07236, p=0.00000 873
V168                                     ,   F-statistic=35.84863, p=0.00000 965
V171                                     ,   F-statistic=6876.38875, p=0.00000 62
V177                                     ,   F-statistic=26.27124, p=0.00000 862
V178                                     ,   F-statistic=38.22487, p=0.00000 1236
V179                                     ,   F-statistic=35.32729, p=0.00000 921
V180                                     ,   F-statistic=12.19456, p=0.00048 84
V182                                     ,   F-statistic=14.26650, p=0.00016 84
V187                                     ,   F-statistic=82.01146, p=0.00000 215
V201                                     ,   F-statistic=16856.02855, p=0.00000 56
V202                                     ,   F-statistic=45.39293, p=0.00000 10970
V203                                     ,   F-statistic=55.54978, p=0.00000 14951
V204                                     ,   F-statistic=60.13305, p=0.00000 12858
V205                                     ,   F-statistic=13.67504, p=0.00022 1953
V206                                     ,   F-statistic=4.52806, p=0.03335 1581
V207                                     ,   F-statistic=21.53123, p=0.00000 2705
V208                                     ,   F-statistic=11.70640, p=0.00062 2093
V209                                     ,   F-statistic=49.58723, p=0.00000 2674
V210                                     ,   F-statistic=0.05960, p=0.80713 2262
V211                                     ,   F-statistic=48.17990, p=0.00000 7624
V212                                     ,   F-statistic=62.41706, p=0.00000 8868
V213                                     ,   F-statistic=59.28563, p=0.00000 8317
V214                                     ,   F-statistic=3.28560, p=0.06989 2282
V215                                     ,   F-statistic=3.21830, p=0.07282 2747
V216                                     ,   F-statistic=0.05228, p=0.81914 2532
V217                                     ,   F-statistic=273.98881, p=0.00000 304
V218                                     ,   F-statistic=295.75410, p=0.00000 401
V219                                     ,   F-statistic=277.07672, p=0.00000 379
V221                                     ,   F-statistic=2528.82003, p=0.00000 77
V222                                     ,   F-statistic=4026.25520, p=0.00000 76
V224                                     ,   F-statistic=0.17276, p=0.67767 79
V226                                     ,   F-statistic=0.67393, p=0.41169 81
V228                                     ,   F-statistic=10162.77697, p=0.00000 55
V229                                     ,   F-statistic=2503.20448, p=0.00000 91
V230                                     ,   F-statistic=7401.92756, p=0.00000 66
V231                                     ,   F-statistic=222.73487, p=0.00000 294
V232                                     ,   F-statistic=390.01638, p=0.00000 338
V233                                     ,   F-statistic=279.30291, p=0.00000 333
V234                                     ,   F-statistic=53.01029, p=0.00000 122
V245                                     ,   F-statistic=1189.20909, p=0.00000 58
V253                                     ,   F-statistic=50.39006, p=0.00000 66
V258                                     ,   F-statistic=12632.06226, p=0.00000 67
V259                                     ,   F-statistic=3328.22023, p=0.00000 68
V263                                     ,   F-statistic=28.30710, p=0.00000 10422
V264                                     ,   F-statistic=11.42220, p=0.00073 13358
V265                                     ,   F-statistic=26.39408, p=0.00000 11757
V266                                     ,   F-statistic=3.60378, p=0.05765 1871
V267                                     ,   F-statistic=7.12289, p=0.00761 2884
V268                                     ,   F-statistic=3.47409, p=0.06234 2286
V269                                     ,   F-statistic=4.54627, p=0.03299 151
V270                                     ,   F-statistic=34.78841, p=0.00000 1972
V271                                     ,   F-statistic=105.06963, p=0.00000 2286
V272                                     ,   F-statistic=91.15867, p=0.00000 2082
V273                                     ,   F-statistic=32.05464, p=0.00000 4689
V274                                     ,   F-statistic=31.41484, p=0.00000 8315
V275                                     ,   F-statistic=38.23804, p=0.00000 4965
V276                                     ,   F-statistic=12.22436, p=0.00047 2263
V277                                     ,   F-statistic=15.03960, p=0.00011 2540
V278                                     ,   F-statistic=13.14749, p=0.00029 2398
V279                                     ,   F-statistic=8.30948, p=0.00394 881
V280                                     ,   F-statistic=0.36129, p=0.54779 975
V283                                     ,   F-statistic=7585.01985, p=0.00000 62
V285                                     ,   F-statistic=49.20336, p=0.00000 96
V290                                     ,   F-statistic=953.42763, p=0.00000 58
V291                                     ,   F-statistic=246.83555, p=0.00000 219
V292                                     ,   F-statistic=423.76303, p=0.00000 173
V293                                     ,   F-statistic=11.96267, p=0.00054 870
V294                                     ,   F-statistic=10.61896, p=0.00112 1286
V295                                     ,   F-statistic=1.66959, p=0.19631 928
V296                                     ,   F-statistic=10.20032, p=0.00140 94
V298                                     ,   F-statistic=0.98916, p=0.31995 94
V306                                     ,   F-statistic=2.04992, p=0.15222 16210
V307                                     ,   F-statistic=20.89358, p=0.00000 37367
V308                                     ,   F-statistic=6.25009, p=0.01242 23064
V309                                     ,   F-statistic=242.35559, p=0.00000 3239
V310                                     ,   F-statistic=72.36754, p=0.00000 7759
V311                                     ,   F-statistic=0.99820, p=0.31775 2526
V312                                     ,   F-statistic=835.01886, p=0.00000 5143
V313                                     ,   F-statistic=1016.37502, p=0.00000 3915
V314                                     ,   F-statistic=876.33173, p=0.00000 5974
V315                                     ,   F-statistic=1377.84553, p=0.00000 4540
V316                                     ,   F-statistic=5.17498, p=0.02291 9814
V317                                     ,   F-statistic=14.82121, p=0.00012 15184
V318                                     ,   F-statistic=0.58715, p=0.44352 12309
V319                                     ,   F-statistic=0.00218, p=0.96276 4799
V320                                     ,   F-statistic=14.53340, p=0.00014 6439
V321                                     ,   F-statistic=1.66098, p=0.19747 5560
V322                                     ,   F-statistic=38.22813, p=0.00000 881
V323                                     ,   F-statistic=44.84077, p=0.00000 1411
V324                                     ,   F-statistic=47.48458, p=0.00000 976
V329                                     ,   F-statistic=43.96148, p=0.00000 100
V330                                     ,   F-statistic=36.90161, p=0.00000 56
V331                                     ,   F-statistic=39.81114, p=0.00000 1758
V332                                     ,   F-statistic=45.37959, p=0.00000 2453
V333                                     ,   F-statistic=47.99171, p=0.00000 1971
V334                                     ,   F-statistic=0.01673, p=0.89708 143
V335                                     ,   F-statistic=2.44629, p=0.11781 669
V336                                     ,   F-statistic=0.47508, p=0.49066 355
V337                                     ,   F-statistic=2.67789, p=0.10175 254
V338                                     ,   F-statistic=30.86541, p=0.00000 380
V339                                     ,   F-statistic=17.71005, p=0.00003 334
id_01                                    ,   F-statistic=0.21069, p=0.64623 77
id_02                                    ,   F-statistic=5.40582, p=0.02007 115655
id_05                                    ,   F-statistic=0.62754, p=0.42826 93
id_06                                    ,   F-statistic=0.32488, p=0.56869 101
id_10                                    ,   F-statistic=0.15965, p=0.68948 62
id_11                                    ,   F-statistic=0.74750, p=0.38727 146
id_13                                    ,   F-statistic=1.07632, p=0.29952 54
id_17                                    ,   F-statistic=1.92097, p=0.16575 104
id_19                                    ,   F-statistic=0.09828, p=0.75391 522
id_20                                    ,   F-statistic=0.34673, p=0.55597 394
Date                                     ,   F-statistic=101.40691, p=0.00000 573349
In [57]:
# Significant variables
# significant_numerical_variables

EDA Inferences:

  • The target class in imbalanced
  • Only 3.5% transactions are fraud in terms of count and 3.87% in terms of transaction amount
  • TransactionAmt is right skewed so log transform needs to be used to make it normally distributed
  • Fraud Transaction rate is maximum for Product Category C and minimum for Product Category W
  • 97% of transactions are from Mastercard(32%) and Visa(65%
  • Fraud transaction rate is highest for discover cards(~8%) against ~3.5% of Mastercard and Visa and 2.87% in American Express
  • Almost all the transactions are from Credit and Debit cards.
  • Debit card transactions are almost 3 times as compared to credit card transactions.
  • Fraud transaction rate is high for Credit cards as compared to Debit cards.
  • Fraud transaction rate for Microsoft is high as compared to Google and Yahoo mail # p emaildomain
  • Fraud transaction rate (amount) for Google is high as comapred to Microsoft and Yahoo mail #p emaildomain
  • There isn't any information about R_emaildomain for Majority of transactions (76.75% count , 85.62% amount) #r emaildomain
  • Fraud transaction rate for Google is high as compared to Yahoo, anaonymous.com and Microsoft #r emaildomain
  • Surprisingly fraud transaction rate is high on the days when number of transactions are less
  • Day 29,30 and 31 are having less transactions, looks like people are broke at the month end
  • Surprisingly fraud transaction rate is high on the days when number of transactions and transaction amounts are less. Day 0 and 6
  • Day 0 and 6 have less transactions, these might be weekend days
  • Transactions start decreasing mid night but the fraud rate starts increasing
  • Transactions from 3 AM to 12 PM needs to monitored very closely

Mini Challenge

  • Analyze the columns having majority of missing values. Create a new column which has a flag whether the value is missing or not, later on try to find out some pattern of missing data with the target class

7. Feature Engineering

Feature engineering is the process of using domain and statistical knowledge to extract features from raw data via data mining techniques.

These features often help to improve the performance of machine learning models.

The goal of this section is to:

  • Engineer domain specific features
  • Dimensionality reduction
  • Encode the categorical features
In [58]:
df.head()
Out[58]:
TransactionID isFraud TransactionDT TransactionAmt ProductCD card1 card2 card3 card4 card5 card6 addr1 addr2 dist1 P_emaildomain R_emaildomain C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 D1 D2 D3 D4 D5 D6 D8 D9 D10 D11 D12 D13 D14 D15 M1 M2 M3 M4 M5 M6 M7 M8 M9 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60 V61 V62 V63 V64 V65 V66 V67 V68 V69 V70 V71 V72 V73 V74 V75 V76 V77 V78 V79 V80 V81 V82 V83 V84 V85 V86 V87 V88 V89 V90 V91 V92 V93 V94 V95 V96 V97 V98 V99 V100 V101 V102 V103 V104 V105 V106 V107 V108 V109 V110 V111 V112 V113 V114 V115 V116 V117 V118 V119 V120 V121 V122 V123 V124 V125 V126 V127 V128 V129 V130 V131 V132 V133 V134 V135 V136 V137 V138 V139 V140 V141 V142 V143 V144 V145 V146 V147 V148 V149 V150 V151 V152 V153 V154 V155 V156 V157 V158 V159 V160 V161 V162 V163 V164 V165 V166 V167 V168 V169 V170 V171 V172 V173 V174 V175 V176 V177 V178 V179 V180 V181 V182 V183 V184 V185 V186 V187 V188 V189 V190 V191 V192 V193 V194 V195 V196 V197 ... V134_missing_flag V135_missing_flag V136_missing_flag V137_missing_flag V138_missing_flag V139_missing_flag V140_missing_flag V141_missing_flag V142_missing_flag V143_missing_flag V144_missing_flag V145_missing_flag V146_missing_flag V147_missing_flag V148_missing_flag V149_missing_flag V150_missing_flag V151_missing_flag V152_missing_flag V153_missing_flag V154_missing_flag V155_missing_flag V156_missing_flag V157_missing_flag V158_missing_flag V159_missing_flag V160_missing_flag V161_missing_flag V162_missing_flag V163_missing_flag V164_missing_flag V165_missing_flag V166_missing_flag V167_missing_flag V168_missing_flag V169_missing_flag V170_missing_flag V171_missing_flag V172_missing_flag V173_missing_flag V174_missing_flag V175_missing_flag V176_missing_flag V177_missing_flag V178_missing_flag V179_missing_flag V180_missing_flag V181_missing_flag V182_missing_flag V183_missing_flag V184_missing_flag V185_missing_flag V186_missing_flag V187_missing_flag V188_missing_flag V189_missing_flag V190_missing_flag V191_missing_flag V192_missing_flag V193_missing_flag V194_missing_flag V195_missing_flag V196_missing_flag V197_missing_flag V198_missing_flag V199_missing_flag V200_missing_flag V201_missing_flag V202_missing_flag V203_missing_flag V204_missing_flag V205_missing_flag V206_missing_flag V207_missing_flag V208_missing_flag V209_missing_flag V210_missing_flag V211_missing_flag V212_missing_flag V213_missing_flag V214_missing_flag V215_missing_flag V216_missing_flag V217_missing_flag V218_missing_flag V219_missing_flag V220_missing_flag V221_missing_flag V222_missing_flag V223_missing_flag V224_missing_flag V225_missing_flag V226_missing_flag V227_missing_flag V228_missing_flag V229_missing_flag V230_missing_flag V231_missing_flag V232_missing_flag V233_missing_flag V234_missing_flag V235_missing_flag V236_missing_flag V237_missing_flag V238_missing_flag V239_missing_flag V240_missing_flag V241_missing_flag V242_missing_flag V243_missing_flag V244_missing_flag V245_missing_flag V246_missing_flag V247_missing_flag V248_missing_flag V249_missing_flag V250_missing_flag V251_missing_flag V252_missing_flag V253_missing_flag V254_missing_flag V255_missing_flag V256_missing_flag V257_missing_flag V258_missing_flag V259_missing_flag V260_missing_flag V261_missing_flag V262_missing_flag V263_missing_flag V264_missing_flag V265_missing_flag V266_missing_flag V267_missing_flag V268_missing_flag V269_missing_flag V270_missing_flag V271_missing_flag V272_missing_flag V273_missing_flag V274_missing_flag V275_missing_flag V276_missing_flag V277_missing_flag V278_missing_flag V279_missing_flag V280_missing_flag V281_missing_flag V282_missing_flag V283_missing_flag V284_missing_flag V285_missing_flag V286_missing_flag V287_missing_flag V288_missing_flag V289_missing_flag V290_missing_flag V291_missing_flag V292_missing_flag V293_missing_flag V294_missing_flag V295_missing_flag V296_missing_flag V297_missing_flag V298_missing_flag V299_missing_flag V300_missing_flag V301_missing_flag V302_missing_flag V303_missing_flag V304_missing_flag V305_missing_flag V306_missing_flag V307_missing_flag V308_missing_flag V309_missing_flag V310_missing_flag V311_missing_flag V312_missing_flag V313_missing_flag V314_missing_flag V315_missing_flag V316_missing_flag V317_missing_flag V318_missing_flag V319_missing_flag V320_missing_flag V321_missing_flag V322_missing_flag V323_missing_flag V324_missing_flag V325_missing_flag V326_missing_flag V327_missing_flag V328_missing_flag V329_missing_flag V330_missing_flag V331_missing_flag V332_missing_flag V333_missing_flag V334_missing_flag V335_missing_flag V336_missing_flag V337_missing_flag V338_missing_flag V339_missing_flag id_01_missing_flag id_02_missing_flag id_03_missing_flag id_04_missing_flag id_05_missing_flag id_06_missing_flag id_07_missing_flag id_08_missing_flag id_09_missing_flag id_10_missing_flag id_11_missing_flag id_12_missing_flag id_13_missing_flag id_14_missing_flag id_15_missing_flag id_16_missing_flag id_17_missing_flag id_18_missing_flag id_19_missing_flag id_20_missing_flag id_21_missing_flag id_22_missing_flag id_23_missing_flag id_24_missing_flag id_25_missing_flag id_26_missing_flag id_27_missing_flag id_28_missing_flag id_29_missing_flag id_30_missing_flag id_31_missing_flag id_32_missing_flag id_33_missing_flag id_34_missing_flag id_35_missing_flag id_36_missing_flag id_37_missing_flag id_38_missing_flag DeviceType_missing_flag DeviceInfo_missing_flag Date _Weekdays _Hours _Days
0 2987000 0 86400 68.5 W 13926 NaN 150.0 discover 142.0 credit 315.0 87.0 19.0 NoInf NoInf 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 2.0 0.0 1.0 1.0 14.0 NaN 13.0 NaN NaN NaN NaN NaN 13.0 13.0 NaN NaN NaN 0.0 T T T M2 F T NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 117.0 0.0 0.0 0.0 0.0 0.0 117.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... False False False False True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False True True True True True True True True False False True False False False False True False False True True True True True True True False False False False False False False False False False False False False 2017-12-02 00:00:00 5 0 2
1 2987001 0 86401 29.0 W 2755 404.0 150.0 mastercard 102.0 credit 325.0 87.0 NaN Google NoInf 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN NaN NaN M0 T T NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... False False False False True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False True True False False True True True True False False False False False False False True False False True True True True True True True False False False False False False False False False False False False False 2017-12-02 00:00:01 5 0 2
2 2987002 0 86469 59.0 W 4663 490.0 150.0 visa 166.0 debit 330.0 87.0 287.0 Microsoft NoInf 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN NaN 0.0 NaN NaN NaN NaN 0.0 315.0 NaN NaN NaN 315.0 T T T M0 F F F F F 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... False False False False True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False False False False False True True False False False False False True False False False True False False True True True True True True True False False True False True True True False False False False False False 2017-12-02 00:01:09 5 0 2
3 2987003 0 86499 50.0 W 18132 567.0 150.0 mastercard 117.0 debit 476.0 87.0 NaN Yahoo Mail NoInf 2.0 5.0 0.0 0.0 0.0 4.0 0.0 0.0 1.0 0.0 1.0 0.0 25.0 1.0 112.0 112.0 0.0 94.0 0.0 NaN NaN NaN 84.0 NaN NaN NaN NaN 111.0 NaN NaN NaN M0 T F NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 48.0 28.0 0.0 10.0 4.0 1.0 38.0 24.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 50.0 1758.0 925.0 0.0 354.0 135.0 50.0 1404.0 790.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... False False False False True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False True True False False True True True True False False False True False False False True False False True True True True True True True False False True False True True True False False False False False True 2017-12-02 00:01:39 5 0 2
4 2987004 0 86506 50.0 H 4497 514.0 150.0 mastercard 102.0 credit 420.0 87.0 NaN Google NoInf 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.0 18.0 140.0 0.0 0.0 0.0 0.0 1803.0 49.0 64.0 0.0 0.0 0.0 0.0 0.0 0.0 15560.0 169690.796875 0.0 0.0 0.0 515.0 5155.0 2840.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True False False False False True False False False False False False False True True True True True True True False False False False False False False False False False False False False 2017-12-02 00:01:46 5 0 2

5 rows × 840 columns

Domain Specific Features

You need to engineer the domain specific features. This might boost up the predictive power. This often gives better performing models

Domain knowledge is one of the key pillars of data science. So always understand the domain before attempting the problem.

In [59]:
# Transaction amount minus mean of transaction 
df['Trans_min_mean'] = df['TransactionAmt'] - np.nanmean(df['TransactionAmt'],dtype="float64")
df['Trans_min_std']  = df['Trans_min_mean'] / np.nanstd(df['TransactionAmt'].astype("float64"),dtype="float64")

Replace value by the group's mean (or standard dev)

In [60]:
# Features for transaction amount and card 
df['TransactionAmt_to_mean_card1'] = df['TransactionAmt'] / df.groupby(['card1'])['TransactionAmt'].transform('mean')
df['TransactionAmt_to_mean_card4'] = df['TransactionAmt'] / df.groupby(['card4'])['TransactionAmt'].transform('mean')
df['TransactionAmt_to_std_card1']  = df['TransactionAmt'] / df.groupby(['card1'])['TransactionAmt'].transform('std')
df['TransactionAmt_to_std_card4']  = df['TransactionAmt'] / df.groupby(['card4'])['TransactionAmt'].transform('std')
In [62]:
# Log of transaction amount
df['TransactionAmt'] = np.log(df['TransactionAmt'])
In [63]:
df.head()
Out[63]:
TransactionID isFraud TransactionDT TransactionAmt ProductCD card1 card2 card3 card4 card5 card6 addr1 addr2 dist1 P_emaildomain R_emaildomain C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 D1 D2 D3 D4 D5 D6 D8 D9 D10 D11 D12 D13 D14 D15 M1 M2 M3 M4 M5 M6 M7 M8 M9 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60 V61 V62 V63 V64 V65 V66 V67 V68 V69 V70 V71 V72 V73 V74 V75 V76 V77 V78 V79 V80 V81 V82 V83 V84 V85 V86 V87 V88 V89 V90 V91 V92 V93 V94 V95 V96 V97 V98 V99 V100 V101 V102 V103 V104 V105 V106 V107 V108 V109 V110 V111 V112 V113 V114 V115 V116 V117 V118 V119 V120 V121 V122 V123 V124 V125 V126 V127 V128 V129 V130 V131 V132 V133 V134 V135 V136 V137 V138 V139 V140 V141 V142 V143 V144 V145 V146 V147 V148 V149 V150 V151 V152 V153 V154 V155 V156 V157 V158 V159 V160 V161 V162 V163 V164 V165 V166 V167 V168 V169 V170 V171 V172 V173 V174 V175 V176 V177 V178 V179 V180 V181 V182 V183 V184 V185 V186 V187 V188 V189 V190 V191 V192 V193 V194 V195 V196 V197 ... V140_missing_flag V141_missing_flag V142_missing_flag V143_missing_flag V144_missing_flag V145_missing_flag V146_missing_flag V147_missing_flag V148_missing_flag V149_missing_flag V150_missing_flag V151_missing_flag V152_missing_flag V153_missing_flag V154_missing_flag V155_missing_flag V156_missing_flag V157_missing_flag V158_missing_flag V159_missing_flag V160_missing_flag V161_missing_flag V162_missing_flag V163_missing_flag V164_missing_flag V165_missing_flag V166_missing_flag V167_missing_flag V168_missing_flag V169_missing_flag V170_missing_flag V171_missing_flag V172_missing_flag V173_missing_flag V174_missing_flag V175_missing_flag V176_missing_flag V177_missing_flag V178_missing_flag V179_missing_flag V180_missing_flag V181_missing_flag V182_missing_flag V183_missing_flag V184_missing_flag V185_missing_flag V186_missing_flag V187_missing_flag V188_missing_flag V189_missing_flag V190_missing_flag V191_missing_flag V192_missing_flag V193_missing_flag V194_missing_flag V195_missing_flag V196_missing_flag V197_missing_flag V198_missing_flag V199_missing_flag V200_missing_flag V201_missing_flag V202_missing_flag V203_missing_flag V204_missing_flag V205_missing_flag V206_missing_flag V207_missing_flag V208_missing_flag V209_missing_flag V210_missing_flag V211_missing_flag V212_missing_flag V213_missing_flag V214_missing_flag V215_missing_flag V216_missing_flag V217_missing_flag V218_missing_flag V219_missing_flag V220_missing_flag V221_missing_flag V222_missing_flag V223_missing_flag V224_missing_flag V225_missing_flag V226_missing_flag V227_missing_flag V228_missing_flag V229_missing_flag V230_missing_flag V231_missing_flag V232_missing_flag V233_missing_flag V234_missing_flag V235_missing_flag V236_missing_flag V237_missing_flag V238_missing_flag V239_missing_flag V240_missing_flag V241_missing_flag V242_missing_flag V243_missing_flag V244_missing_flag V245_missing_flag V246_missing_flag V247_missing_flag V248_missing_flag V249_missing_flag V250_missing_flag V251_missing_flag V252_missing_flag V253_missing_flag V254_missing_flag V255_missing_flag V256_missing_flag V257_missing_flag V258_missing_flag V259_missing_flag V260_missing_flag V261_missing_flag V262_missing_flag V263_missing_flag V264_missing_flag V265_missing_flag V266_missing_flag V267_missing_flag V268_missing_flag V269_missing_flag V270_missing_flag V271_missing_flag V272_missing_flag V273_missing_flag V274_missing_flag V275_missing_flag V276_missing_flag V277_missing_flag V278_missing_flag V279_missing_flag V280_missing_flag V281_missing_flag V282_missing_flag V283_missing_flag V284_missing_flag V285_missing_flag V286_missing_flag V287_missing_flag V288_missing_flag V289_missing_flag V290_missing_flag V291_missing_flag V292_missing_flag V293_missing_flag V294_missing_flag V295_missing_flag V296_missing_flag V297_missing_flag V298_missing_flag V299_missing_flag V300_missing_flag V301_missing_flag V302_missing_flag V303_missing_flag V304_missing_flag V305_missing_flag V306_missing_flag V307_missing_flag V308_missing_flag V309_missing_flag V310_missing_flag V311_missing_flag V312_missing_flag V313_missing_flag V314_missing_flag V315_missing_flag V316_missing_flag V317_missing_flag V318_missing_flag V319_missing_flag V320_missing_flag V321_missing_flag V322_missing_flag V323_missing_flag V324_missing_flag V325_missing_flag V326_missing_flag V327_missing_flag V328_missing_flag V329_missing_flag V330_missing_flag V331_missing_flag V332_missing_flag V333_missing_flag V334_missing_flag V335_missing_flag V336_missing_flag V337_missing_flag V338_missing_flag V339_missing_flag id_01_missing_flag id_02_missing_flag id_03_missing_flag id_04_missing_flag id_05_missing_flag id_06_missing_flag id_07_missing_flag id_08_missing_flag id_09_missing_flag id_10_missing_flag id_11_missing_flag id_12_missing_flag id_13_missing_flag id_14_missing_flag id_15_missing_flag id_16_missing_flag id_17_missing_flag id_18_missing_flag id_19_missing_flag id_20_missing_flag id_21_missing_flag id_22_missing_flag id_23_missing_flag id_24_missing_flag id_25_missing_flag id_26_missing_flag id_27_missing_flag id_28_missing_flag id_29_missing_flag id_30_missing_flag id_31_missing_flag id_32_missing_flag id_33_missing_flag id_34_missing_flag id_35_missing_flag id_36_missing_flag id_37_missing_flag id_38_missing_flag DeviceType_missing_flag DeviceInfo_missing_flag Date _Weekdays _Hours _Days Trans_min_mean Trans_min_std TransactionAmt_to_mean_card1 TransactionAmt_to_mean_card4 TransactionAmt_to_std_card1 TransactionAmt_to_std_card4
0 2987000 0 86400 4.226834 W 13926 NaN 150.0 discover 142.0 credit 315.0 87.0 19.0 NoInf NoInf 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 2.0 0.0 1.0 1.0 14.0 NaN 13.0 NaN NaN NaN NaN NaN 13.0 13.0 NaN NaN NaN 0.0 T T T M2 F T NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 117.0 0.0 0.0 0.0 0.0 0.0 117.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False True True True True True True True True False False True False False False False True False False True True True True True True True False False False False False False False False False False False False False 2017-12-02 00:00:00 5 0 2 -66.527347 -0.278174 0.194638 0.257761 0.184560 0.170241
1 2987001 0 86401 3.367296 W 2755 404.0 150.0 mastercard 102.0 credit 325.0 87.0 NaN Google NoInf 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN NaN NaN M0 T T NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False True True False False True True True True False False False False False False False True False False True True True True True True True False False False False False False False False False False False False False 2017-12-02 00:00:01 5 0 2 -106.027347 -0.443337 0.123780 0.219053 0.063004 0.114214
2 2987002 0 86469 4.077537 W 4663 490.0 150.0 visa 166.0 debit 330.0 87.0 287.0 Microsoft NoInf 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN NaN 0.0 NaN NaN NaN NaN 0.0 315.0 NaN NaN NaN 315.0 T T T M0 F F F F F 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False False False False False True True False False False False False True False False False True False False True True True True True True True False False True False True True True False False False False False False 2017-12-02 00:01:09 5 0 2 -76.027347 -0.317897 0.608151 0.443070 0.589226 0.258550
3 2987003 0 86499 3.912023 W 18132 567.0 150.0 mastercard 117.0 debit 476.0 87.0 NaN Yahoo Mail NoInf 2.0 5.0 0.0 0.0 0.0 4.0 0.0 0.0 1.0 0.0 1.0 0.0 25.0 1.0 112.0 112.0 0.0 94.0 0.0 NaN NaN NaN 84.0 NaN NaN NaN NaN 111.0 NaN NaN NaN M0 T F NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 48.0 28.0 0.0 10.0 4.0 1.0 38.0 24.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 50.0 1758.0 925.0 0.0 354.0 135.0 50.0 1404.0 790.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False True True False False True True True True False False False True False False False True False False True True True True True True True False False True False True True True False False False False False True 2017-12-02 00:01:39 5 0 2 -85.027347 -0.355529 0.405136 0.377678 0.259460 0.196921
4 2987004 0 86506 3.912023 H 4497 514.0 150.0 mastercard 102.0 credit 420.0 87.0 NaN Google NoInf 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.0 18.0 140.0 0.0 0.0 0.0 0.0 1803.0 49.0 64.0 0.0 0.0 0.0 0.0 0.0 0.0 15560.0 169690.796875 0.0 0.0 0.0 515.0 5155.0 2840.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True False False False False True False False False False False False False True True True True True True True False False False False False False False False False False False False False 2017-12-02 00:01:46 5 0 2 -85.027347 -0.355529 0.515616 0.377678 0.882898 0.196921

5 rows × 846 columns

In [65]:
# Save train df to csv file 
# df.to_csv("Intermediate_Datasets/df_intermediate1.csv",index = False)

# Read train df
df = pd.read_csv("Intermediate_Datasets/df_intermediate1.csv")
In [ ]:
 

8. Dimensionality Reduction - PCA

When dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the “essence” of the data.

Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension.

Principal component analysis is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance.

In [66]:
# initialize function to perform PCA
def perform_PCA(df, cols, n_components, prefix='PCA_', rand_seed=4):
    pca = PCA(n_components=n_components, random_state=rand_seed)
    principalComponents = pca.fit_transform(df[cols])
    principalDf = pd.DataFrame(principalComponents)
    df.drop(cols, axis=1, inplace=True)

    principalDf.rename(columns=lambda x: str(prefix)+str(x), inplace=True)
    df = pd.concat([df, principalDf], axis=1)
    return df

Create a list of all the columns on which PCA needs to performed

In [67]:
# Columns starting from V1 to V339
filter_col = df.columns[53:392]

Impute missing values in the mas_v columns, later use minmax_scale function to scale the values in these columns

In [ ]:
from   sklearn.preprocessing import minmax_scale
In [68]:
# Fill na values and scale V columns
for col in filter_col:
    df[col] = df[col].fillna((df[col].min() - 2))
    df[col] = (minmax_scale(df[col], feature_range=(0,1)))

# Perform PCA    
df          = perform_PCA(df, filter_col, prefix='PCA_V_', n_components=30)

Reduce memory usage of df as lot of new features have been created

In [69]:
df = reduce_mem_usage(df)
Mem. usage decreased to 1138.99 Mb (21.4% reduction)
In [70]:
df.head()
Out[70]:
TransactionID isFraud TransactionDT TransactionAmt ProductCD card1 card2 card3 card4 card5 card6 addr1 addr2 dist1 P_emaildomain R_emaildomain C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 D1 D2 D3 D4 D5 D6 D8 D9 D10 D11 D12 D13 D14 D15 M1 M2 M3 M4 M5 M6 M7 M8 M9 id_01 id_02 id_03 id_04 id_05 id_06 id_09 id_10 id_11 id_12 id_13 id_14 id_15 id_16 id_17 id_19 id_20 id_28 id_29 id_30 id_31 id_32 id_33 id_34 id_35 id_36 id_37 id_38 DeviceType DeviceInfo card2_missing_flag card3_missing_flag card4_missing_flag card5_missing_flag card6_missing_flag addr1_missing_flag addr2_missing_flag dist1_missing_flag dist2_missing_flag P_emaildomain_missing_flag R_emaildomain_missing_flag D1_missing_flag D2_missing_flag D3_missing_flag D4_missing_flag D5_missing_flag D6_missing_flag D7_missing_flag D8_missing_flag D9_missing_flag D10_missing_flag D11_missing_flag D12_missing_flag D13_missing_flag D14_missing_flag D15_missing_flag M1_missing_flag M2_missing_flag M3_missing_flag M4_missing_flag M5_missing_flag M6_missing_flag M7_missing_flag M8_missing_flag M9_missing_flag V1_missing_flag V2_missing_flag V3_missing_flag V4_missing_flag V5_missing_flag V6_missing_flag V7_missing_flag V8_missing_flag V9_missing_flag V10_missing_flag V11_missing_flag V12_missing_flag V13_missing_flag V14_missing_flag V15_missing_flag V16_missing_flag V17_missing_flag V18_missing_flag V19_missing_flag V20_missing_flag V21_missing_flag V22_missing_flag V23_missing_flag V24_missing_flag V25_missing_flag V26_missing_flag V27_missing_flag V28_missing_flag V29_missing_flag V30_missing_flag V31_missing_flag V32_missing_flag V33_missing_flag V34_missing_flag V35_missing_flag V36_missing_flag V37_missing_flag V38_missing_flag V39_missing_flag V40_missing_flag V41_missing_flag V42_missing_flag V43_missing_flag V44_missing_flag V45_missing_flag V46_missing_flag V47_missing_flag V48_missing_flag V49_missing_flag V50_missing_flag V51_missing_flag V52_missing_flag V53_missing_flag V54_missing_flag V55_missing_flag V56_missing_flag V57_missing_flag V58_missing_flag V59_missing_flag V60_missing_flag V61_missing_flag V62_missing_flag V63_missing_flag V64_missing_flag V65_missing_flag V66_missing_flag V67_missing_flag V68_missing_flag V69_missing_flag V70_missing_flag V71_missing_flag V72_missing_flag V73_missing_flag V74_missing_flag V75_missing_flag V76_missing_flag V77_missing_flag V78_missing_flag V79_missing_flag V80_missing_flag V81_missing_flag V82_missing_flag V83_missing_flag V84_missing_flag V85_missing_flag V86_missing_flag V87_missing_flag V88_missing_flag V89_missing_flag V90_missing_flag V91_missing_flag V92_missing_flag V93_missing_flag V94_missing_flag V95_missing_flag V96_missing_flag V97_missing_flag V98_missing_flag V99_missing_flag V100_missing_flag V101_missing_flag V102_missing_flag V103_missing_flag V104_missing_flag V105_missing_flag V106_missing_flag V107_missing_flag V108_missing_flag V109_missing_flag V110_missing_flag V111_missing_flag V112_missing_flag V113_missing_flag V114_missing_flag V115_missing_flag V116_missing_flag V117_missing_flag V118_missing_flag V119_missing_flag V120_missing_flag V121_missing_flag V122_missing_flag V123_missing_flag V124_missing_flag V125_missing_flag V126_missing_flag V127_missing_flag V128_missing_flag V129_missing_flag V130_missing_flag V131_missing_flag V132_missing_flag ... V170_missing_flag V171_missing_flag V172_missing_flag V173_missing_flag V174_missing_flag V175_missing_flag V176_missing_flag V177_missing_flag V178_missing_flag V179_missing_flag V180_missing_flag V181_missing_flag V182_missing_flag V183_missing_flag V184_missing_flag V185_missing_flag V186_missing_flag V187_missing_flag V188_missing_flag V189_missing_flag V190_missing_flag V191_missing_flag V192_missing_flag V193_missing_flag V194_missing_flag V195_missing_flag V196_missing_flag V197_missing_flag V198_missing_flag V199_missing_flag V200_missing_flag V201_missing_flag V202_missing_flag V203_missing_flag V204_missing_flag V205_missing_flag V206_missing_flag V207_missing_flag V208_missing_flag V209_missing_flag V210_missing_flag V211_missing_flag V212_missing_flag V213_missing_flag V214_missing_flag V215_missing_flag V216_missing_flag V217_missing_flag V218_missing_flag V219_missing_flag V220_missing_flag V221_missing_flag V222_missing_flag V223_missing_flag V224_missing_flag V225_missing_flag V226_missing_flag V227_missing_flag V228_missing_flag V229_missing_flag V230_missing_flag V231_missing_flag V232_missing_flag V233_missing_flag V234_missing_flag V235_missing_flag V236_missing_flag V237_missing_flag V238_missing_flag V239_missing_flag V240_missing_flag V241_missing_flag V242_missing_flag V243_missing_flag V244_missing_flag V245_missing_flag V246_missing_flag V247_missing_flag V248_missing_flag V249_missing_flag V250_missing_flag V251_missing_flag V252_missing_flag V253_missing_flag V254_missing_flag V255_missing_flag V256_missing_flag V257_missing_flag V258_missing_flag V259_missing_flag V260_missing_flag V261_missing_flag V262_missing_flag V263_missing_flag V264_missing_flag V265_missing_flag V266_missing_flag V267_missing_flag V268_missing_flag V269_missing_flag V270_missing_flag V271_missing_flag V272_missing_flag V273_missing_flag V274_missing_flag V275_missing_flag V276_missing_flag V277_missing_flag V278_missing_flag V279_missing_flag V280_missing_flag V281_missing_flag V282_missing_flag V283_missing_flag V284_missing_flag V285_missing_flag V286_missing_flag V287_missing_flag V288_missing_flag V289_missing_flag V290_missing_flag V291_missing_flag V292_missing_flag V293_missing_flag V294_missing_flag V295_missing_flag V296_missing_flag V297_missing_flag V298_missing_flag V299_missing_flag V300_missing_flag V301_missing_flag V302_missing_flag V303_missing_flag V304_missing_flag V305_missing_flag V306_missing_flag V307_missing_flag V308_missing_flag V309_missing_flag V310_missing_flag V311_missing_flag V312_missing_flag V313_missing_flag V314_missing_flag V315_missing_flag V316_missing_flag V317_missing_flag V318_missing_flag V319_missing_flag V320_missing_flag V321_missing_flag V322_missing_flag V323_missing_flag V324_missing_flag V325_missing_flag V326_missing_flag V327_missing_flag V328_missing_flag V329_missing_flag V330_missing_flag V331_missing_flag V332_missing_flag V333_missing_flag V334_missing_flag V335_missing_flag V336_missing_flag V337_missing_flag V338_missing_flag V339_missing_flag id_01_missing_flag id_02_missing_flag id_03_missing_flag id_04_missing_flag id_05_missing_flag id_06_missing_flag id_07_missing_flag id_08_missing_flag id_09_missing_flag id_10_missing_flag id_11_missing_flag id_12_missing_flag id_13_missing_flag id_14_missing_flag id_15_missing_flag id_16_missing_flag id_17_missing_flag id_18_missing_flag id_19_missing_flag id_20_missing_flag id_21_missing_flag id_22_missing_flag id_23_missing_flag id_24_missing_flag id_25_missing_flag id_26_missing_flag id_27_missing_flag id_28_missing_flag id_29_missing_flag id_30_missing_flag id_31_missing_flag id_32_missing_flag id_33_missing_flag id_34_missing_flag id_35_missing_flag id_36_missing_flag id_37_missing_flag id_38_missing_flag DeviceType_missing_flag DeviceInfo_missing_flag Date _Weekdays _Hours _Days Trans_min_mean Trans_min_std TransactionAmt_to_mean_card1 TransactionAmt_to_mean_card4 TransactionAmt_to_std_card1 TransactionAmt_to_std_card4 PCA_V_0 PCA_V_1 PCA_V_2 PCA_V_3 PCA_V_4 PCA_V_5 PCA_V_6 PCA_V_7 PCA_V_8 PCA_V_9 PCA_V_10 PCA_V_11 PCA_V_12 PCA_V_13 PCA_V_14 PCA_V_15 PCA_V_16 PCA_V_17 PCA_V_18 PCA_V_19 PCA_V_20 PCA_V_21 PCA_V_22 PCA_V_23 PCA_V_24 PCA_V_25 PCA_V_26 PCA_V_27 PCA_V_28 PCA_V_29
0 2987000 0 86400 4.226562 W 13926 NaN 150.0 discover 142.0 credit 315.0 87.0 19.0 NoInf NoInf 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 2.0 0.0 1.0 1.0 14.0 NaN 13.0 NaN NaN NaN NaN NaN 13.0 13.0 NaN NaN NaN 0.0 T T T M2 F T NaN NaN NaN 0.0 70787.0 NaN NaN NaN NaN NaN NaN 100.0 NotFound NaN -480.0 New NotFound 166.0 542.0 144.0 New NotFound Android Samsung 32.0 2220x1080 match_status:2 T F T T mobile SAMSUNG SM-G892A Build/NRD90M True False False False False False False False True True True False True False True True True True True True False False True True True False False False False False False False True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False ... True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False True True True True True True True True False False True False False False False True False False True True True True True True True False False False False False False False False False False False False False 2017-12-02 00:00:00 5 0 2 -66.5 -0.278076 0.194580 0.257812 0.184560 0.170288 -0.157349 0.919434 -0.843750 0.308105 -0.089417 0.003044 -0.020050 -0.187622 0.038208 0.002604 -0.010536 0.034058 -0.044434 -0.089722 0.044769 0.001550 -0.003441 0.018616 -0.018387 0.010078 -0.026947 -0.021362 -0.054626 0.025375 0.018814 -0.006039 0.004055 -0.043335 0.008026 -0.007957
1 2987001 0 86401 3.367188 W 2755 404.0 150.0 mastercard 102.0 credit 325.0 87.0 NaN Google NoInf 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN NaN NaN M0 T T NaN NaN NaN -5.0 98945.0 NaN NaN 0.0 -5.0 NaN NaN 100.0 NotFound 49.0 -300.0 New NotFound 166.0 621.0 500.0 New NotFound iOS Safari 32.0 1334x750 match_status:1 T F F T mobile iOS Device False False False False False False False True True False True False True True False True True True True True False True True True True False True True True False False False True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False ... True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False True True False False True True True True False False False False False False False True False False True True True True True True True False False False False False False False False False False False False False 2017-12-02 00:00:01 5 0 2 -106.0 -0.443359 0.123779 0.218994 0.063004 0.114197 -0.086365 -0.800293 -0.152344 -0.363525 -0.101868 -0.002291 0.032318 -0.068848 0.040222 -0.180176 -0.059387 0.002302 0.018982 -0.029556 0.016647 -0.006241 -0.004208 0.010170 -0.001647 -0.022919 0.006298 -0.021164 0.054626 -0.042542 -0.026794 0.003531 0.001647 0.001576 -0.003611 -0.003761
2 2987002 0 86469 4.078125 W 4663 490.0 150.0 visa 166.0 debit 330.0 87.0 287.0 Microsoft NoInf 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN NaN 0.0 NaN NaN NaN NaN 0.0 315.0 NaN NaN NaN 315.0 T T T M0 F F F F F -5.0 191631.0 0.0 0.0 0.0 0.0 0.0 0.0 100.0 NotFound 52.0 NaN Found Found 121.0 410.0 142.0 Found Found NAN Chrome NaN NaN NaN F F T T desktop Windows False False False False False False False False True False True False True True False True True True True True False False True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False ... True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False False False False False True True False False False False False True False False False True False False True True True True True True True False False True False True True True False False False False False False 2017-12-02 00:01:09 5 0 2 -76.0 -0.317871 0.607910 0.443115 0.589226 0.258545 -0.800781 0.316895 0.273193 -0.026352 0.043182 -0.008064 -0.039276 -0.217041 0.017715 0.033508 -0.000322 -0.015343 0.020676 -0.046051 -0.006725 0.004875 0.001104 0.007484 -0.007793 -0.006611 -0.008270 0.009857 -0.007710 0.003191 0.002834 0.001886 0.003839 0.002903 -0.019592 -0.003424
3 2987003 0 86499 3.912109 W 18132 567.0 150.0 mastercard 117.0 debit 476.0 87.0 NaN Yahoo Mail NoInf 2.0 5.0 0.0 0.0 0.0 4.0 0.0 0.0 1.0 0.0 1.0 0.0 25.0 1.0 112.0 112.0 0.0 94.0 0.0 NaN NaN NaN 84.0 NaN NaN NaN NaN 111.0 NaN NaN NaN M0 T F NaN NaN NaN -5.0 221832.0 NaN NaN 0.0 -6.0 NaN NaN 100.0 NotFound 52.0 NaN New NotFound 225.0 176.0 507.0 New NotFound NAN Chrome NaN NaN NaN F F T T desktop NaN False False False False False False False True True False True False False False False False True True True True False True True True True False True True True False False False True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False ... True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False True True False False True True True True False False False True False False False True False False True True True True True True True False False True False True True True False False False False False True 2017-12-02 00:01:39 5 0 2 -85.0 -0.355469 0.405029 0.377686 0.259460 0.196899 -0.237427 -0.811523 -0.123657 -0.423828 -0.067261 0.025040 0.110413 -0.253906 0.004803 0.170410 -0.012550 -0.014488 0.005268 0.031891 -0.013489 -0.017319 -0.001888 -0.084717 0.050293 0.140747 0.058960 -0.020218 0.066589 -0.010910 -0.017868 0.025528 0.003674 0.003511 0.026047 -0.041962
4 2987004 0 86506 3.912109 H 4497 514.0 150.0 mastercard 102.0 credit 420.0 87.0 NaN Google NoInf 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 7460.0 0.0 0.0 1.0 0.0 0.0 0.0 100.0 NotFound NaN -300.0 Found Found 166.0 529.0 575.0 Found Found Mac Chrome 24.0 1280x800 match_status:2 T F T T desktop MacOS False False False False False False False True True False True False True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False ... False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True False False False False True False False False False False False False True True True True True True True False False False False False False False False False False False False False 2017-12-02 00:01:46 5 0 2 -85.0 -0.355469 0.515625 0.377686 0.882898 0.196899 2.904297 0.380127 0.480713 -0.009659 -0.171753 1.170898 -0.178223 0.004517 0.043793 -0.001614 0.015610 -0.017883 -0.020523 -0.005604 -0.010262 -0.003822 -0.011459 -0.008179 0.013390 0.017792 -0.013702 0.000093 -0.005245 -0.142334 0.202271 0.014458 0.012764 0.002150 0.014008 -0.001770

5 rows × 537 columns

In [71]:
# Plot first 2 PCA features and colour by target variable
plt.figure(figsize=(12, 8));
groups = df.groupby("isFraud")
for name, group in groups:
    plt.scatter(group["PCA_V_0"], group["PCA_V_1"], label=name)
plt.legend()
plt.show()

9. Feature Encoding

Encoding is the process of converting data from one form to another. Most of the Machine learning algorithms can not handle categorical values unless we convert them to numerical values. Many algorithm’s performances vary based on how Categorical columns are encoded.

  • Frequency Encoding - It is a way to utilize the frequency of the categories as labels. In the cases where the frequency is related somewhat with the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on nature of the data.

Create a list of variables that needs to be encoded using frequency encoding. Let's note down the features which has more than 30 unique values, We would using frequency encoding for these features only

In [72]:
cat_columns = df.select_dtypes(include=['object']).columns
len(cat_columns)
Out[72]:
30
In [73]:
binary_columns = [col for col in df.columns if df[col].nunique() == 2]
len(binary_columns)
Out[73]:
432
In [74]:
num_columns = [col for col in df.columns if (col not in cat_columns) & (col not in binary_columns)]
len(num_columns)
Out[74]:
92
In [75]:
cat_columns = cat_columns.to_list() + binary_columns
In [76]:
# Frequecny encoding variables
frequency_encoded_variables = []
for col in cat_columns:
    if df[col].nunique() > 30:
        print(col, df[col].nunique())
        frequency_encoded_variables.append(col)
id_33 260
DeviceInfo 1786
Date 573349

It's time to encode the variables using frequency encoding

In [77]:
# Frequecny enocde the variables
for variable in tqdm(frequency_encoded_variables):
    # group by frequency 
    fq = df.groupby(variable).size()/len(df)    
    # mapping values to dataframe 
    df.loc[:, "{}".format(variable)] = df[variable].map(fq)   
    cat_columns.remove(variable)
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.18it/s]
In [78]:
df.head()
Out[78]:
TransactionID isFraud TransactionDT TransactionAmt ProductCD card1 card2 card3 card4 card5 card6 addr1 addr2 dist1 P_emaildomain R_emaildomain C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 D1 D2 D3 D4 D5 D6 D8 D9 D10 D11 D12 D13 D14 D15 M1 M2 M3 M4 M5 M6 M7 M8 M9 id_01 id_02 id_03 id_04 id_05 id_06 id_09 id_10 id_11 id_12 id_13 id_14 id_15 id_16 id_17 id_19 id_20 id_28 id_29 id_30 id_31 id_32 id_33 id_34 id_35 id_36 id_37 id_38 DeviceType DeviceInfo card2_missing_flag card3_missing_flag card4_missing_flag card5_missing_flag card6_missing_flag addr1_missing_flag addr2_missing_flag dist1_missing_flag dist2_missing_flag P_emaildomain_missing_flag R_emaildomain_missing_flag D1_missing_flag D2_missing_flag D3_missing_flag D4_missing_flag D5_missing_flag D6_missing_flag D7_missing_flag D8_missing_flag D9_missing_flag D10_missing_flag D11_missing_flag D12_missing_flag D13_missing_flag D14_missing_flag D15_missing_flag M1_missing_flag M2_missing_flag M3_missing_flag M4_missing_flag M5_missing_flag M6_missing_flag M7_missing_flag M8_missing_flag M9_missing_flag V1_missing_flag V2_missing_flag V3_missing_flag V4_missing_flag V5_missing_flag V6_missing_flag V7_missing_flag V8_missing_flag V9_missing_flag V10_missing_flag V11_missing_flag V12_missing_flag V13_missing_flag V14_missing_flag V15_missing_flag V16_missing_flag V17_missing_flag V18_missing_flag V19_missing_flag V20_missing_flag V21_missing_flag V22_missing_flag V23_missing_flag V24_missing_flag V25_missing_flag V26_missing_flag V27_missing_flag V28_missing_flag V29_missing_flag V30_missing_flag V31_missing_flag V32_missing_flag V33_missing_flag V34_missing_flag V35_missing_flag V36_missing_flag V37_missing_flag V38_missing_flag V39_missing_flag V40_missing_flag V41_missing_flag V42_missing_flag V43_missing_flag V44_missing_flag V45_missing_flag V46_missing_flag V47_missing_flag V48_missing_flag V49_missing_flag V50_missing_flag V51_missing_flag V52_missing_flag V53_missing_flag V54_missing_flag V55_missing_flag V56_missing_flag V57_missing_flag V58_missing_flag V59_missing_flag V60_missing_flag V61_missing_flag V62_missing_flag V63_missing_flag V64_missing_flag V65_missing_flag V66_missing_flag V67_missing_flag V68_missing_flag V69_missing_flag V70_missing_flag V71_missing_flag V72_missing_flag V73_missing_flag V74_missing_flag V75_missing_flag V76_missing_flag V77_missing_flag V78_missing_flag V79_missing_flag V80_missing_flag V81_missing_flag V82_missing_flag V83_missing_flag V84_missing_flag V85_missing_flag V86_missing_flag V87_missing_flag V88_missing_flag V89_missing_flag V90_missing_flag V91_missing_flag V92_missing_flag V93_missing_flag V94_missing_flag V95_missing_flag V96_missing_flag V97_missing_flag V98_missing_flag V99_missing_flag V100_missing_flag V101_missing_flag V102_missing_flag V103_missing_flag V104_missing_flag V105_missing_flag V106_missing_flag V107_missing_flag V108_missing_flag V109_missing_flag V110_missing_flag V111_missing_flag V112_missing_flag V113_missing_flag V114_missing_flag V115_missing_flag V116_missing_flag V117_missing_flag V118_missing_flag V119_missing_flag V120_missing_flag V121_missing_flag V122_missing_flag V123_missing_flag V124_missing_flag V125_missing_flag V126_missing_flag V127_missing_flag V128_missing_flag V129_missing_flag V130_missing_flag V131_missing_flag V132_missing_flag ... V170_missing_flag V171_missing_flag V172_missing_flag V173_missing_flag V174_missing_flag V175_missing_flag V176_missing_flag V177_missing_flag V178_missing_flag V179_missing_flag V180_missing_flag V181_missing_flag V182_missing_flag V183_missing_flag V184_missing_flag V185_missing_flag V186_missing_flag V187_missing_flag V188_missing_flag V189_missing_flag V190_missing_flag V191_missing_flag V192_missing_flag V193_missing_flag V194_missing_flag V195_missing_flag V196_missing_flag V197_missing_flag V198_missing_flag V199_missing_flag V200_missing_flag V201_missing_flag V202_missing_flag V203_missing_flag V204_missing_flag V205_missing_flag V206_missing_flag V207_missing_flag V208_missing_flag V209_missing_flag V210_missing_flag V211_missing_flag V212_missing_flag V213_missing_flag V214_missing_flag V215_missing_flag V216_missing_flag V217_missing_flag V218_missing_flag V219_missing_flag V220_missing_flag V221_missing_flag V222_missing_flag V223_missing_flag V224_missing_flag V225_missing_flag V226_missing_flag V227_missing_flag V228_missing_flag V229_missing_flag V230_missing_flag V231_missing_flag V232_missing_flag V233_missing_flag V234_missing_flag V235_missing_flag V236_missing_flag V237_missing_flag V238_missing_flag V239_missing_flag V240_missing_flag V241_missing_flag V242_missing_flag V243_missing_flag V244_missing_flag V245_missing_flag V246_missing_flag V247_missing_flag V248_missing_flag V249_missing_flag V250_missing_flag V251_missing_flag V252_missing_flag V253_missing_flag V254_missing_flag V255_missing_flag V256_missing_flag V257_missing_flag V258_missing_flag V259_missing_flag V260_missing_flag V261_missing_flag V262_missing_flag V263_missing_flag V264_missing_flag V265_missing_flag V266_missing_flag V267_missing_flag V268_missing_flag V269_missing_flag V270_missing_flag V271_missing_flag V272_missing_flag V273_missing_flag V274_missing_flag V275_missing_flag V276_missing_flag V277_missing_flag V278_missing_flag V279_missing_flag V280_missing_flag V281_missing_flag V282_missing_flag V283_missing_flag V284_missing_flag V285_missing_flag V286_missing_flag V287_missing_flag V288_missing_flag V289_missing_flag V290_missing_flag V291_missing_flag V292_missing_flag V293_missing_flag V294_missing_flag V295_missing_flag V296_missing_flag V297_missing_flag V298_missing_flag V299_missing_flag V300_missing_flag V301_missing_flag V302_missing_flag V303_missing_flag V304_missing_flag V305_missing_flag V306_missing_flag V307_missing_flag V308_missing_flag V309_missing_flag V310_missing_flag V311_missing_flag V312_missing_flag V313_missing_flag V314_missing_flag V315_missing_flag V316_missing_flag V317_missing_flag V318_missing_flag V319_missing_flag V320_missing_flag V321_missing_flag V322_missing_flag V323_missing_flag V324_missing_flag V325_missing_flag V326_missing_flag V327_missing_flag V328_missing_flag V329_missing_flag V330_missing_flag V331_missing_flag V332_missing_flag V333_missing_flag V334_missing_flag V335_missing_flag V336_missing_flag V337_missing_flag V338_missing_flag V339_missing_flag id_01_missing_flag id_02_missing_flag id_03_missing_flag id_04_missing_flag id_05_missing_flag id_06_missing_flag id_07_missing_flag id_08_missing_flag id_09_missing_flag id_10_missing_flag id_11_missing_flag id_12_missing_flag id_13_missing_flag id_14_missing_flag id_15_missing_flag id_16_missing_flag id_17_missing_flag id_18_missing_flag id_19_missing_flag id_20_missing_flag id_21_missing_flag id_22_missing_flag id_23_missing_flag id_24_missing_flag id_25_missing_flag id_26_missing_flag id_27_missing_flag id_28_missing_flag id_29_missing_flag id_30_missing_flag id_31_missing_flag id_32_missing_flag id_33_missing_flag id_34_missing_flag id_35_missing_flag id_36_missing_flag id_37_missing_flag id_38_missing_flag DeviceType_missing_flag DeviceInfo_missing_flag Date _Weekdays _Hours _Days Trans_min_mean Trans_min_std TransactionAmt_to_mean_card1 TransactionAmt_to_mean_card4 TransactionAmt_to_std_card1 TransactionAmt_to_std_card4 PCA_V_0 PCA_V_1 PCA_V_2 PCA_V_3 PCA_V_4 PCA_V_5 PCA_V_6 PCA_V_7 PCA_V_8 PCA_V_9 PCA_V_10 PCA_V_11 PCA_V_12 PCA_V_13 PCA_V_14 PCA_V_15 PCA_V_16 PCA_V_17 PCA_V_18 PCA_V_19 PCA_V_20 PCA_V_21 PCA_V_22 PCA_V_23 PCA_V_24 PCA_V_25 PCA_V_26 PCA_V_27 PCA_V_28 PCA_V_29
0 2987000 0 86400 4.226562 W 13926 NaN 150.0 discover 142.0 credit 315.0 87.0 19.0 NoInf NoInf 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 2.0 0.0 1.0 1.0 14.0 NaN 13.0 NaN NaN NaN NaN NaN 13.0 13.0 NaN NaN NaN 0.0 T T T M2 F T NaN NaN NaN 0.0 70787.0 NaN NaN NaN NaN NaN NaN 100.0 NotFound NaN -480.0 New NotFound 166.0 542.0 144.0 New NotFound Android Samsung 32.0 0.000921 match_status:2 T F T T mobile 0.000015 True False False False False False False False True True True False True False True True True True True True False False True True True False False False False False False False True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False ... True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False True True True True True True True True False False True False False False False True False False True True True True True True True False False False False False False False False False False False False False 0.000002 5 0 2 -66.5 -0.278076 0.194580 0.257812 0.184560 0.170288 -0.157349 0.919434 -0.843750 0.308105 -0.089417 0.003044 -0.020050 -0.187622 0.038208 0.002604 -0.010536 0.034058 -0.044434 -0.089722 0.044769 0.001550 -0.003441 0.018616 -0.018387 0.010078 -0.026947 -0.021362 -0.054626 0.025375 0.018814 -0.006039 0.004055 -0.043335 0.008026 -0.007957
1 2987001 0 86401 3.367188 W 2755 404.0 150.0 mastercard 102.0 credit 325.0 87.0 NaN Google NoInf 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN NaN NaN M0 T T NaN NaN NaN -5.0 98945.0 NaN NaN 0.0 -5.0 NaN NaN 100.0 NotFound 49.0 -300.0 New NotFound 166.0 621.0 500.0 New NotFound iOS Safari 32.0 0.010917 match_status:1 T F F T mobile 0.033498 False False False False False False False True True False True False True True False True True True True True False True True True True False True True True False False False True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False ... True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False True True False False True True True True False False False False False False False True False False True True True True True True True False False False False False False False False False False False False False 0.000002 5 0 2 -106.0 -0.443359 0.123779 0.218994 0.063004 0.114197 -0.086365 -0.800293 -0.152344 -0.363525 -0.101868 -0.002291 0.032318 -0.068848 0.040222 -0.180176 -0.059387 0.002302 0.018982 -0.029556 0.016647 -0.006241 -0.004208 0.010170 -0.001647 -0.022919 0.006298 -0.021164 0.054626 -0.042542 -0.026794 0.003531 0.001647 0.001576 -0.003611 -0.003761
2 2987002 0 86469 4.078125 W 4663 490.0 150.0 visa 166.0 debit 330.0 87.0 287.0 Microsoft NoInf 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN NaN 0.0 NaN NaN NaN NaN 0.0 315.0 NaN NaN NaN 315.0 T T T M0 F F F F F -5.0 191631.0 0.0 0.0 0.0 0.0 0.0 0.0 100.0 NotFound 52.0 NaN Found Found 121.0 410.0 142.0 Found Found NAN Chrome NaN NaN NaN F F T T desktop 0.080811 False False False False False False False False True False True False True True False True True True True True False False True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False ... True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False False False False False True True False False False False False True False False False True False False True True True True True True True False False True False True True True False False False False False False 0.000002 5 0 2 -76.0 -0.317871 0.607910 0.443115 0.589226 0.258545 -0.800781 0.316895 0.273193 -0.026352 0.043182 -0.008064 -0.039276 -0.217041 0.017715 0.033508 -0.000322 -0.015343 0.020676 -0.046051 -0.006725 0.004875 0.001104 0.007484 -0.007793 -0.006611 -0.008270 0.009857 -0.007710 0.003191 0.002834 0.001886 0.003839 0.002903 -0.019592 -0.003424
3 2987003 0 86499 3.912109 W 18132 567.0 150.0 mastercard 117.0 debit 476.0 87.0 NaN Yahoo Mail NoInf 2.0 5.0 0.0 0.0 0.0 4.0 0.0 0.0 1.0 0.0 1.0 0.0 25.0 1.0 112.0 112.0 0.0 94.0 0.0 NaN NaN NaN 84.0 NaN NaN NaN NaN 111.0 NaN NaN NaN M0 T F NaN NaN NaN -5.0 221832.0 NaN NaN 0.0 -6.0 NaN NaN 100.0 NotFound 52.0 NaN New NotFound 225.0 176.0 507.0 New NotFound NAN Chrome NaN NaN NaN F F T T desktop NaN False False False False False False False True True False True False False False False False True True True True False True True True True False True True True False False False True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False ... True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True True True True True True True True True True True True True True True True True False False True True False False True True True True False False False True False False False True False False True True True True True True True False False True False True True True False False False False False True 0.000002 5 0 2 -85.0 -0.355469 0.405029 0.377686 0.259460 0.196899 -0.237427 -0.811523 -0.123657 -0.423828 -0.067261 0.025040 0.110413 -0.253906 0.004803 0.170410 -0.012550 -0.014488 0.005268 0.031891 -0.013489 -0.017319 -0.001888 -0.084717 0.050293 0.140747 0.058960 -0.020218 0.066589 -0.010910 -0.017868 0.025528 0.003674 0.003511 0.026047 -0.041962
4 2987004 0 86506 3.912109 H 4497 514.0 150.0 mastercard 102.0 credit 420.0 87.0 NaN Google NoInf 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 7460.0 0.0 0.0 1.0 0.0 0.0 0.0 100.0 NotFound NaN -300.0 Found Found 166.0 529.0 575.0 Found Found Mac Chrome 24.0 0.003639 match_status:2 T F T T desktop 0.021291 False False False False False False False True True False True False True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False ... False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True True False False False False True False False False False False False False True True True True True True True False False False False False False False False False False False False False 0.000002 5 0 2 -85.0 -0.355469 0.515625 0.377686 0.882898 0.196899 2.904297 0.380127 0.480713 -0.009659 -0.171753 1.170898 -0.178223 0.004517 0.043793 -0.001614 0.015610 -0.017883 -0.020523 -0.005604 -0.010262 -0.003822 -0.011459 -0.008179 0.013390 0.017792 -0.013702 0.000093 -0.005245 -0.142334 0.202271 0.014458 0.012764 0.002150 0.014008 -0.001770

5 rows × 537 columns

  • Label encoding - Label Encoding refers to converting the labels into numeric form so as to convert it into the machine-readable form. Machine learning algorithms can then decide in a better way on how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.

It is a popular encoding technique for handling categorical variables. In this technique, each label is assigned a unique integer based on alphabetical ordering.

In [79]:
# Label encode the variables
for col in cat_columns:
    lbl        = LabelEncoder()
    lbl.fit(list(df[col].values))
    df[col] = lbl.transform(list(df[col].values))

Let's reduce the memory usage as lot of new columns has been added to the data frame

In [80]:
# Reduce memory usage
df = reduce_mem_usage(df)
Mem. usage decreased to 361.00 Mb (82.7% reduction)

Tip : Save the train df, and clean all memory

In [ ]:
# Save train df to csv file 
df.to_csv("Intermediate_Datasets/df_intermediate2.csv", index = False)

10. Data Preprocessing for Model Building

The goal of this section is to:

  • Clean up columns
  • Create X and y
  • Split the dataset in training and test sets
In [2]:
# Read train df
df = pd.read_csv("Intermediate_Datasets/df_intermediate2.csv")
# df = df.sample(10000, random_state=0)
In [3]:
df.loc[:, 'isFraud'].value_counts()
Out[3]:
0    569877
1     20663
Name: isFraud, dtype: int64

Drop the columns which may not be useful for model building

In [4]:
df = df.drop(['TransactionID','TransactionDT','Date'], axis=1)

Separate the x variables and y variables

In [5]:
# Split the y variable series and x variables dataset
X = df.drop(['isFraud'],axis=1)
y = df.isFraud.astype(bool)

# Delete train df
del df

# Collect garbage
gc.collect()
Out[5]:
0

Split the dataset into train set and test set. Train set will be used to train the model. Test set will be used to check the performance of model

In [6]:
# Split the dataset into the training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
In [7]:
# Head of X_train
X_train.head()
Out[7]:
TransactionAmt ProductCD card1 card2 card3 card4 card5 card6 addr1 addr2 dist1 P_emaildomain R_emaildomain C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 D1 D2 D3 D4 D5 D6 D8 D9 D10 D11 D12 D13 D14 D15 M1 M2 M3 M4 M5 M6 M7 M8 M9 id_01 id_02 id_03 id_04 id_05 id_06 id_09 id_10 id_11 id_12 id_13 id_14 id_15 id_16 id_17 id_19 id_20 id_28 id_29 id_30 id_31 id_32 id_33 id_34 id_35 id_36 id_37 id_38 DeviceType DeviceInfo card2_missing_flag card3_missing_flag card4_missing_flag card5_missing_flag card6_missing_flag addr1_missing_flag addr2_missing_flag dist1_missing_flag dist2_missing_flag P_emaildomain_missing_flag R_emaildomain_missing_flag D1_missing_flag D2_missing_flag D3_missing_flag D4_missing_flag D5_missing_flag D6_missing_flag D7_missing_flag D8_missing_flag D9_missing_flag D10_missing_flag D11_missing_flag D12_missing_flag D13_missing_flag D14_missing_flag D15_missing_flag M1_missing_flag M2_missing_flag M3_missing_flag M4_missing_flag M5_missing_flag M6_missing_flag M7_missing_flag M8_missing_flag M9_missing_flag V1_missing_flag V2_missing_flag V3_missing_flag V4_missing_flag V5_missing_flag V6_missing_flag V7_missing_flag V8_missing_flag V9_missing_flag V10_missing_flag V11_missing_flag V12_missing_flag V13_missing_flag V14_missing_flag V15_missing_flag V16_missing_flag V17_missing_flag V18_missing_flag V19_missing_flag V20_missing_flag V21_missing_flag V22_missing_flag V23_missing_flag V24_missing_flag V25_missing_flag V26_missing_flag V27_missing_flag V28_missing_flag V29_missing_flag V30_missing_flag V31_missing_flag V32_missing_flag V33_missing_flag V34_missing_flag V35_missing_flag V36_missing_flag V37_missing_flag V38_missing_flag V39_missing_flag V40_missing_flag V41_missing_flag V42_missing_flag V43_missing_flag V44_missing_flag V45_missing_flag V46_missing_flag V47_missing_flag V48_missing_flag V49_missing_flag V50_missing_flag V51_missing_flag V52_missing_flag V53_missing_flag V54_missing_flag V55_missing_flag V56_missing_flag V57_missing_flag V58_missing_flag V59_missing_flag V60_missing_flag V61_missing_flag V62_missing_flag V63_missing_flag V64_missing_flag V65_missing_flag V66_missing_flag V67_missing_flag V68_missing_flag V69_missing_flag V70_missing_flag V71_missing_flag V72_missing_flag V73_missing_flag V74_missing_flag V75_missing_flag V76_missing_flag V77_missing_flag V78_missing_flag V79_missing_flag V80_missing_flag V81_missing_flag V82_missing_flag V83_missing_flag V84_missing_flag V85_missing_flag V86_missing_flag V87_missing_flag V88_missing_flag V89_missing_flag V90_missing_flag V91_missing_flag V92_missing_flag V93_missing_flag V94_missing_flag V95_missing_flag V96_missing_flag V97_missing_flag V98_missing_flag V99_missing_flag V100_missing_flag V101_missing_flag V102_missing_flag V103_missing_flag V104_missing_flag V105_missing_flag V106_missing_flag V107_missing_flag V108_missing_flag V109_missing_flag V110_missing_flag V111_missing_flag V112_missing_flag V113_missing_flag V114_missing_flag V115_missing_flag V116_missing_flag V117_missing_flag V118_missing_flag V119_missing_flag V120_missing_flag V121_missing_flag V122_missing_flag V123_missing_flag V124_missing_flag V125_missing_flag V126_missing_flag V127_missing_flag V128_missing_flag V129_missing_flag V130_missing_flag V131_missing_flag V132_missing_flag V133_missing_flag V134_missing_flag V135_missing_flag ... V169_missing_flag V170_missing_flag V171_missing_flag V172_missing_flag V173_missing_flag V174_missing_flag V175_missing_flag V176_missing_flag V177_missing_flag V178_missing_flag V179_missing_flag V180_missing_flag V181_missing_flag V182_missing_flag V183_missing_flag V184_missing_flag V185_missing_flag V186_missing_flag V187_missing_flag V188_missing_flag V189_missing_flag V190_missing_flag V191_missing_flag V192_missing_flag V193_missing_flag V194_missing_flag V195_missing_flag V196_missing_flag V197_missing_flag V198_missing_flag V199_missing_flag V200_missing_flag V201_missing_flag V202_missing_flag V203_missing_flag V204_missing_flag V205_missing_flag V206_missing_flag V207_missing_flag V208_missing_flag V209_missing_flag V210_missing_flag V211_missing_flag V212_missing_flag V213_missing_flag V214_missing_flag V215_missing_flag V216_missing_flag V217_missing_flag V218_missing_flag V219_missing_flag V220_missing_flag V221_missing_flag V222_missing_flag V223_missing_flag V224_missing_flag V225_missing_flag V226_missing_flag V227_missing_flag V228_missing_flag V229_missing_flag V230_missing_flag V231_missing_flag V232_missing_flag V233_missing_flag V234_missing_flag V235_missing_flag V236_missing_flag V237_missing_flag V238_missing_flag V239_missing_flag V240_missing_flag V241_missing_flag V242_missing_flag V243_missing_flag V244_missing_flag V245_missing_flag V246_missing_flag V247_missing_flag V248_missing_flag V249_missing_flag V250_missing_flag V251_missing_flag V252_missing_flag V253_missing_flag V254_missing_flag V255_missing_flag V256_missing_flag V257_missing_flag V258_missing_flag V259_missing_flag V260_missing_flag V261_missing_flag V262_missing_flag V263_missing_flag V264_missing_flag V265_missing_flag V266_missing_flag V267_missing_flag V268_missing_flag V269_missing_flag V270_missing_flag V271_missing_flag V272_missing_flag V273_missing_flag V274_missing_flag V275_missing_flag V276_missing_flag V277_missing_flag V278_missing_flag V279_missing_flag V280_missing_flag V281_missing_flag V282_missing_flag V283_missing_flag V284_missing_flag V285_missing_flag V286_missing_flag V287_missing_flag V288_missing_flag V289_missing_flag V290_missing_flag V291_missing_flag V292_missing_flag V293_missing_flag V294_missing_flag V295_missing_flag V296_missing_flag V297_missing_flag V298_missing_flag V299_missing_flag V300_missing_flag V301_missing_flag V302_missing_flag V303_missing_flag V304_missing_flag V305_missing_flag V306_missing_flag V307_missing_flag V308_missing_flag V309_missing_flag V310_missing_flag V311_missing_flag V312_missing_flag V313_missing_flag V314_missing_flag V315_missing_flag V316_missing_flag V317_missing_flag V318_missing_flag V319_missing_flag V320_missing_flag V321_missing_flag V322_missing_flag V323_missing_flag V324_missing_flag V325_missing_flag V326_missing_flag V327_missing_flag V328_missing_flag V329_missing_flag V330_missing_flag V331_missing_flag V332_missing_flag V333_missing_flag V334_missing_flag V335_missing_flag V336_missing_flag V337_missing_flag V338_missing_flag V339_missing_flag id_01_missing_flag id_02_missing_flag id_03_missing_flag id_04_missing_flag id_05_missing_flag id_06_missing_flag id_07_missing_flag id_08_missing_flag id_09_missing_flag id_10_missing_flag id_11_missing_flag id_12_missing_flag id_13_missing_flag id_14_missing_flag id_15_missing_flag id_16_missing_flag id_17_missing_flag id_18_missing_flag id_19_missing_flag id_20_missing_flag id_21_missing_flag id_22_missing_flag id_23_missing_flag id_24_missing_flag id_25_missing_flag id_26_missing_flag id_27_missing_flag id_28_missing_flag id_29_missing_flag id_30_missing_flag id_31_missing_flag id_32_missing_flag id_33_missing_flag id_34_missing_flag id_35_missing_flag id_36_missing_flag id_37_missing_flag id_38_missing_flag DeviceType_missing_flag DeviceInfo_missing_flag _Weekdays _Hours _Days Trans_min_mean Trans_min_std TransactionAmt_to_mean_card1 TransactionAmt_to_mean_card4 TransactionAmt_to_std_card1 TransactionAmt_to_std_card4 PCA_V_0 PCA_V_1 PCA_V_2 PCA_V_3 PCA_V_4 PCA_V_5 PCA_V_6 PCA_V_7 PCA_V_8 PCA_V_9 PCA_V_10 PCA_V_11 PCA_V_12 PCA_V_13 PCA_V_14 PCA_V_15 PCA_V_16 PCA_V_17 PCA_V_18 PCA_V_19 PCA_V_20 PCA_V_21 PCA_V_22 PCA_V_23 PCA_V_24 PCA_V_25 PCA_V_26 PCA_V_27 PCA_V_28 PCA_V_29
464452 3.432 4 15063 514.0 150.0 4 226.0 1 315.0 87.0 18.0 0 2 1.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 2.0 1.0 35.0 35.0 34.0 34.0 34.0 NaN NaN NaN 34.0 34.0 NaN NaN NaN 34.0 1 1 1 3 2 1 0 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 NaN NaN 3 2 NaN NaN NaN 2 2 3 4 NaN NaN 4 2 2 2 2 2 NaN 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 4 18 -104.06 -0.4350 0.1501 0.2324 0.092611 0.13560 -0.9330 0.3564 0.2969 -0.05145 0.012596 0.01012 0.02788 0.1576 -0.053300 0.13940 0.033940 0.021680 -0.04034 0.02574 -0.022110 -0.003386 0.019290 0.013780 -0.007637 -0.031980 0.012856 -0.007412 0.000978 -0.016700 -0.007600 -0.027450 0.00541 -0.006160 0.005920 0.008170
36372 3.717 0 6019 583.0 150.0 4 226.0 1 NaN NaN NaN 0 0 928.0 1301.0 0.0 475.0 0.0 475.0 476.0 688.0 0.0 823.0 667.0 667.0 695.0 300.0 0.0 NaN NaN 0.0 NaN 0.0 NaN NaN NaN NaN 0.0 0.0 302.0 0.0 2 2 2 2 2 2 2 2 2 -5.0 49333.0 0.0 0.0 0.0 -1.0 0.0 0.0 100.0 1 52.0 -360.0 0 0 166.0 410.0 611.0 0 0 4 0 24.0 0.02858 3 1 0 1 0 0 0.0808 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 11 -93.94 -0.3928 0.1823 0.3086 0.103778 0.18000 0.6445 -0.9595 0.6094 0.50200 -0.377700 -0.15760 0.07880 0.1742 1.205000 0.12660 -0.007683 -0.130900 -0.04294 0.00345 0.057530 0.048770 -0.015470 -0.016300 0.073850 0.016130 -0.018260 0.036070 -0.019200 -0.050660 -0.045530 -0.032350 -0.02182 0.038240 -0.009430 0.004864
572387 5.527 4 17188 321.0 150.0 4 226.0 2 299.0 87.0 1.0 6 2 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0.0 1.0 0.0 14.0 1.0 568.0 80.0 1.0 569.0 1.0 NaN NaN NaN 569.0 569.0 NaN NaN NaN 569.0 1 1 1 1 0 1 0 1 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 NaN NaN 3 2 NaN NaN NaN 2 2 3 4 NaN NaN 4 2 2 2 2 2 NaN 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 21 25 116.00 0.4849 1.9370 1.8850 1.348902 1.10000 -0.8013 0.3174 0.2720 -0.02687 0.049070 -0.00634 -0.02972 -0.2195 0.019070 0.03270 -0.009254 -0.009834 0.02269 -0.02753 -0.009140 -0.002127 -0.001934 -0.025130 0.016200 0.040620 0.000128 0.001646 -0.004090 0.010020 0.002650 0.010440 0.00992 0.005665 -0.008720 -0.011665
497276 3.950 4 1214 174.0 150.0 4 226.0 1 204.0 87.0 NaN 4 2 1.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN NaN 0.0 NaN NaN NaN NaN 0.0 0.0 NaN NaN NaN 0.0 1 1 1 3 2 0 0 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 NaN NaN 3 2 NaN NaN NaN 2 2 3 4 NaN NaN 4 2 2 2 2 2 NaN 0 0 0 0 0 0 0 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 19 30 -83.06 -0.3474 0.2540 0.3901 0.147945 0.22770 -0.9326 0.3564 0.2974 -0.05154 0.010610 0.00979 0.02590 0.1611 -0.054500 0.13270 0.035740 -0.010574 -0.00642 -0.03049 -0.012240 0.017960 0.018100 0.002857 -0.016740 0.007100 0.018840 -0.006878 0.008800 -0.004406 0.000245 -0.014570 -0.00884 -0.011110 -0.000597 -0.007458
97470 3.219 1 12695 490.0 150.0 4 226.0 2 325.0 87.0 NaN 6 6 1.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 2 2 3 2 2 2 2 2 -5.0 68281.0 NaN NaN 0.0 0.0 NaN NaN 100.0 1 52.0 NaN 1 1 225.0 266.0 507.0 1 1 3 0 NaN NaN 4 0 0 1 0 0 NaN 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 0 0 0 0 1 5 21 23 -110.00 -0.4600 0.1771 0.1877 0.116133 0.10956 2.8070 0.3457 0.3772 -0.01767 -0.042420 -0.24430 0.02310 0.0118 0.000914 0.02304 -0.009550 -0.022300 -0.02167 -0.01366 -0.005627 -0.009690 0.001454 -0.008240 -0.008900 0.005936 -0.004475 -0.000716 -0.003060 0.008514 -0.017840 0.000079 0.01335 -0.005287 -0.000648 0.007446

5 rows × 533 columns

11. Model Building

Finally, model building starts here.

The goal of this section is to:

  • Build ML models
  • Evaluate the performance

12. XGBoost Classifier

XGBoost is an optimized distributed gradient boosting model designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.

In [12]:
%%time
# Define the model
xgb = XGBClassifier(nthread = -1, random_state=0)

# Train the model
xgb.fit(X_train, y_train)

xgb
Wall time: 10min 13s
Out[12]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=-1, nthread=-1, num_parallel_tree=1,
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              subsample=1, tree_method='exact', validate_parameters=1,
              verbosity=None)

Let's use the model to get predictions on test dataset. We would be looking at the predicted class and predicted probability both in order to evaluate the performance of the model

In [13]:
# Prediction
y_pred_xgb = xgb.predict(X_test)
y_prob_pred_xgb = xgb.predict_proba(X_test)
y_prob_pred_xgb = [x[1] for x in y_prob_pred_xgb]
print("Y predicted : ",y_pred_xgb)
print("Y probability predicted : ",y_prob_pred_xgb[:5])
Y predicted :  [False False False ... False False False]
Y probability predicted :  [0.00088362244, 0.0128200315, 0.003954415, 0.008103509, 0.0021388768]

13. Evaluation Metrics

  • Accuracy Score
  • Confusion Matrix
  • Classification Report
  • AUC Score
  • Concordance Index
  • ROC curve
  • PR curve

Concordance

In [14]:
from bisect import bisect_left, bisect_right

def concordance(actuals, preds):
    ones_preds  = [p for a,p in zip(actuals, preds) if a == 1]
    zeros_preds = [p for a,p in zip(actuals, preds) if a == 0]
    n_ones      = len([x for x in actuals if x == 1])
    n_total_pairs =  float(n_ones) * float(len(actuals) - n_ones)
    # print("Total Pairs: ", n_total_pairs)

    zeros_sorted = sorted(zeros_preds)

    conc = 0; disc = 0; ties = 0;
    for i, one_pred in enumerate(ones_preds):
        cur_conc = bisect_left(zeros_sorted, one_pred)
        cur_ties = bisect_right(zeros_sorted, one_pred) - cur_conc
        conc += cur_conc
        ties += cur_ties
        disc += float(len(zeros_sorted)) - cur_ties - cur_conc

    concordance = conc/n_total_pairs
    discordance = disc/n_total_pairs
    ties_perc = ties/n_total_pairs
    return concordance

All evaluation metrics

In [15]:
def compute_evaluation_metric(model, x_test, y_actual, y_predicted, y_predicted_prob):
    print("\n Accuracy Score : ",accuracy_score(y_actual,y_predicted))
    print("\n AUC Score : ", roc_auc_score(y_actual, y_predicted_prob))
    print("\n Confusion Matrix : \n",confusion_matrix(y_actual, y_predicted))
    print("\n Classification Report : \n",classification_report(y_actual, y_predicted))
    print("\n Concordance Index : ", concordance(y_actual, y_predicted_prob))

    print("\n ROC curve : \n")
    plot_roc_curve(model, x_test, y_actual)
    plt.show() 

    print("\n PR curve : \n")
    plot_precision_recall_curve(model, x_test, y_actual)
    plt.show() 
In [24]:
concordance(y_test.values, y_prob_pred_xgb)
Out[24]:
0.9348031547662842
In [25]:
# Compute Evaluation Metric
compute_evaluation_metric(xgb, X_test, y_test, y_pred_xgb, y_prob_pred_xgb)
 Accuracy Score :  0.9797247716778993

 AUC Score :  0.9348031681385632

 Confusion Matrix : 
 [[170693    348]
 [  3244   2877]]

 Classification Report : 
               precision    recall  f1-score   support

       False       0.98      1.00      0.99    171041
        True       0.89      0.47      0.62      6121

    accuracy                           0.98    177162
   macro avg       0.94      0.73      0.80    177162
weighted avg       0.98      0.98      0.98    177162


 Concordance Index :  0.9348031547662842

 ROC curve : 

 PR curve : 

14. Capture Rates and Calibration Curve

Divide the data in 10 equal bins as per predicted probability scores. Then, compute the percentage of the total target class 1 captured in every bin.

Ideally the proportion should be decreasing as we go down ever bin. Let's check it out

Create validation set

In [26]:
# Create Validation set
validation_df = {'y_test' : y_test, 'y_pred' : y_pred_xgb, 'y_pred_prob' : y_prob_pred_xgb}
validation_df = pd.DataFrame(data = validation_df)

# Add binning column to the dataframe
validation_df['bin_y_pred_prob'] = pd.qcut(validation_df['y_pred_prob'], q=10)
validation_df.head()
Out[26]:
y_test y_pred y_pred_prob bin_y_pred_prob
7681 False False 0.000884 (-0.0009859, 0.00121]
570242 False False 0.012820 (0.00914, 0.0132]
340470 False False 0.003954 (0.00333, 0.00477]
131781 False False 0.008104 (0.00659, 0.00914]
472772 False False 0.002139 (0.00121, 0.00219]
In [27]:
# Change x label
x_label = []
for i in range(len(validation_df['bin_y_pred_prob'].cat.categories[::-1].astype('str'))):
    x_label.append("Bin" + str(i + 1)+ "(" + validation_df['bin_y_pred_prob'].cat.categories[::-1].astype('str')[i] + ")")

Capture Rates Plot

In [28]:
# Plot Distribution of predicted probabilities for every bin
plt.figure(figsize=(12, 8));
sns.stripplot(validation_df.bin_y_pred_prob, validation_df.y_pred_prob, jitter = 0.15, hue = validation_df.y_test, order = validation_df['bin_y_pred_prob'].cat.categories[::-1])
plt.title("Distribution of predicted probabilities for every bin", fontsize=18)
plt.xlabel("Predicted Probability Bins", fontsize=14);
plt.ylabel("Predicted Probability", fontsize=14);
plt.xticks(np.arange(10), x_label, rotation=45);
plt.show()

Gains Table

In [29]:
# Aggregate the data
gains_df             = validation_df.groupby(["bin_y_pred_prob","y_test"]).agg({'y_test': ['count']})
gains_df.columns     = gains_df.columns.map(''.join)
gains_df['prob_bin'] = gains_df.index.get_level_values(0)
gains_df['y_test']   = gains_df.index.get_level_values(1)
gains_df.reset_index(drop = True, inplace = True)
gains_df

# Get infection rate and percentage infections
gains_table = gains_df.pivot(index='prob_bin', columns='y_test', values='y_testcount')
gains_table['prob_bin'] = gains_table.index
gains_table = gains_table.iloc[::-1]
gains_table['prob_bin'] = x_label
gains_table.reset_index(drop = True, inplace = True)
gains_table = gains_table[['prob_bin', 0, 1]]
gains_table.columns = ['prob_bin', "not_fraud", "fraud"]
gains_table['perc_fraud'] = gains_table['fraud']/gains_table['fraud'].sum()
gains_table['perc_not_fraud'] = gains_table['not_fraud']/gains_table['not_fraud'].sum()
gains_table['cum_perc_fraud'] = 100*(gains_table.fraud.cumsum() / gains_table.fraud.sum()) 
gains_table['cum_perc_not_fraud'] = 100*(gains_table.not_fraud.cumsum() / gains_table.not_fraud.sum()) 
gains_table


# Plot
plt.figure(figsize=(12, 8));
sns.set_style("white")
sns.pointplot(x = "prob_bin", y = "cum_perc_fraud", data = gains_table, legend = False, order=gains_table.prob_bin)
plt.xticks(rotation=45);
plt.ylabel("Fraud Rate", fontsize=14)
plt.xlabel("Prediction probability bin", fontsize=14)
plt.title("Fraud rate for every bin", fontsize=18)
plt.show()

Ideally the slope should be high initially and should decrease as we move further to the right. This is not really a good model.

In [16]:
# One big function.
def captures(y_test, y_pred, y_pred_prob):
    # Create Validation set
    validation_df = {'y_test' : y_test, 'y_pred' : y_pred, 'y_pred_prob' : y_pred_prob}
    validation_df = pd.DataFrame(data = validation_df)

    # Add binning column to the dataframe
    try:
        validation_df['bin_y_pred_prob'] = pd.qcut(validation_df['y_pred_prob'], q=10)
    except:
        validation_df['bin_y_pred_prob'] = pd.qcut(validation_df['y_pred_prob'], q=10, duplicates='drop')
    
    # Change x label and column names
    x_label = []
    for i in range(len(validation_df['bin_y_pred_prob'].cat.categories[::-1].astype('str'))):
        x_label.append("Bin" + str(i + 1)+ "(" + validation_df['bin_y_pred_prob'].cat.categories[::-1].astype('str')[i] + ")")
    
    # Plot Distribution of predicted probabilities for every bin
    plt.figure(figsize=(12, 8));
    sns.stripplot(validation_df.bin_y_pred_prob, validation_df.y_pred_prob, jitter = 0.15, hue = validation_df.y_test, order = validation_df['bin_y_pred_prob'].cat.categories[::-1])
    plt.title("Distribution of predicted probabilities for every bin", fontsize=18)
    plt.xlabel("Predicted Probability Bins", fontsize=14);
    plt.ylabel("Predicted Probability", fontsize=14);
    try:
        plt.xticks(np.arange(10), x_label, rotation=45);
    except:
        pass
    plt.show()
    
    # Aggregate the data
    gains_df             = validation_df.groupby(["bin_y_pred_prob","y_test"]).agg({'y_test': ['count']})
    gains_df.columns     = gains_df.columns.map(''.join)
    gains_df['prob_bin'] = gains_df.index.get_level_values(0)
    gains_df['y_test']   = gains_df.index.get_level_values(1)
    gains_df.reset_index(drop = True, inplace = True)
    gains_df

    # Get infection rate and percentage infections
    gains_table = gains_df.pivot(index='prob_bin', columns='y_test', values='y_testcount')
    gains_table['prob_bin'] = gains_table.index
    gains_table = gains_table.iloc[::-1]
    gains_table['prob_bin'] = x_label
    gains_table.reset_index(drop = True, inplace = True)
    gains_table = gains_table[['prob_bin', 0, 1]]
    gains_table.columns = ['prob_bin', "not_fraud", "fraud"]
    gains_table['perc_fraud'] = gains_table['fraud']/gains_table['fraud'].sum()
    gains_table['perc_not_fraud'] = gains_table['not_fraud']/gains_table['not_fraud'].sum()
    gains_table['cum_perc_fraud'] = 100*(gains_table.fraud.cumsum() / gains_table.fraud.sum()) 
    gains_table['cum_perc_not_fraud'] = 100*(gains_table.not_fraud.cumsum() / gains_table.not_fraud.sum()) 
    gains_table


    # Plot
    plt.figure(figsize=(12, 8));
    sns.set_style("white")
    sns.pointplot(x = "prob_bin", y = "cum_perc_fraud", data = gains_table, legend = False, order=gains_table.prob_bin)
    plt.xticks(rotation=45);
    plt.ylabel("Fraud Rate", fontsize=14)
    plt.xlabel("Prediction probability bin", fontsize=14)
    plt.title("Fraud rate for every bin", fontsize=18)
    plt.show()
    return gains_table
In [31]:
# Gains Table and Capture rates plot
captures(y_test, y_pred_xgb, y_prob_pred_xgb)
Out[31]:
prob_bin not_fraud fraud perc_fraud perc_not_fraud cum_perc_fraud cum_perc_not_fraud
0 Bin1((0.0472, 1.0]) 12835 4882 0.797582 0.075040 79.758209 7.504049
1 Bin2((0.0215, 0.0472]) 17204 512 0.083646 0.100584 88.122856 17.562456
2 Bin3((0.0132, 0.0215]) 17460 256 0.041823 0.102081 92.305179 27.770535
3 Bin4((0.00914, 0.0132]) 17572 144 0.023526 0.102736 94.657736 38.044095
4 Bin5((0.00659, 0.00914]) 17606 109 0.017808 0.102934 96.438490 48.337533
5 Bin6((0.00477, 0.00659]) 17649 68 0.011109 0.103186 97.549420 58.656112
6 Bin7((0.00333, 0.00477]) 17662 54 0.008822 0.103262 98.431629 68.982291
7 Bin8((0.00219, 0.00333]) 17668 48 0.007842 0.103297 99.215814 79.311978
8 Bin9((0.00121, 0.00219]) 17689 27 0.004411 0.103420 99.656919 89.653943
9 Bin10((-0.0009859, 0.00121]) 17696 21 0.003431 0.103461 100.000000 100.000000

Calibration Curve

In [17]:
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt
In [18]:
def draw_calibration_curve(y_test, y_prob, n_bins=10):
    plt.figure(figsize=(7, 7), dpi=120)
    ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
    ax2 = plt.subplot2grid((3, 1), (2, 0))
    ax1.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")


    fraction_of_positives, mean_predicted_value = calibration_curve(y_test, y_prob, n_bins=10)

    ax1.plot(mean_predicted_value, fraction_of_positives, "s-", label="%s" % ("Model", ))
    ax2.hist(y_prob, range=(0, 1), bins=10, label="Model", histtype="step", lw=2)

    # Labels
    ax1.set_ylabel("Fraction of positives")
    ax1.set_ylim([-0.05, 1.05])
    ax1.legend(loc="lower right")
    ax1.set_title('Calibration plots  (reliability curve)')

    ax2.set_xlabel("Mean predicted value")
    ax2.set_ylabel("Count")
    ax2.legend(loc="upper center", ncol=2)

    plt.tight_layout()
    plt.show()
  • Chart 1: X axis marks the prediction probability score. Y-axis marks the fraction of the positives.
  • Chart 2: X axis marks the mean predicted value. Y-axis represents the count of records.
In [34]:
draw_calibration_curve(y_test, y_prob_pred_xgb, n_bins=10)

Calibrate the model

Logistic regression

In [35]:
# Prediction
y_pred_xgb_test = xgb.predict(X_test)
y_prob_pred_xgb_test = xgb.predict_proba(X_test)[:, 1]
In [36]:
from sklearn.linear_model import LogisticRegression
X = np.array(y_prob_pred_xgb_test)
clf = LogisticRegression(random_state=0).fit(X.reshape(-1, 1), y_test)
In [37]:
y_prob_pred_calib = clf.predict_proba(X.reshape(-1, 1))[:, 1]
y_pred_calib      = clf.predict(X.reshape(-1, 1))
In [38]:
captures(y_test, y_pred_calib, y_prob_pred_calib)
Out[38]:
prob_bin not_fraud fraud perc_fraud perc_not_fraud cum_perc_fraud cum_perc_not_fraud
0 Bin1((0.0188, 0.998]) 12835 4882 0.797582 0.075040 79.758209 7.504049
1 Bin2((0.0144, 0.0188]) 17204 512 0.083646 0.100584 88.122856 17.562456
2 Bin3((0.0132, 0.0144]) 17460 256 0.041823 0.102081 92.305179 27.770535
3 Bin4((0.0127, 0.0132]) 17572 144 0.023526 0.102736 94.657736 38.044095
4 Bin5((0.0124, 0.0127]) 17606 109 0.017808 0.102934 96.438490 48.337533
5 Bin6((0.0121, 0.0124]) 17649 68 0.011109 0.103186 97.549420 58.656112
6 Bin7((0.0119, 0.0121]) 17662 54 0.008822 0.103262 98.431629 68.982291
7 Bin8((0.0118, 0.0119]) 17668 48 0.007842 0.103297 99.215814 79.311978
8 Bin9((0.0117, 0.0118]) 17689 27 0.004411 0.103420 99.656919 89.653943
9 Bin10((0.010499999999999999, 0.0117]) 17696 21 0.003431 0.103461 100.000000 100.000000
In [39]:
draw_calibration_curve(y_test, y_prob_pred_calib, n_bins=10)

XGBoost with booster = dart

In [40]:
%%time
# Define the model
xgb = XGBClassifier(nthread=-1, random_state=0, booster="dart")

# Train the model
xgb.fit(X_train,y_train)

xgb
Wall time: 11min 20s
Out[40]:
XGBClassifier(base_score=0.5, booster='dart', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=-1, nthread=-1, num_parallel_tree=1,
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              subsample=1, tree_method='exact', validate_parameters=1,
              verbosity=None)

Let's use the model to get predictions on test dataset. We would be looking at the predicted class and predicted probability both in order to evaluate the performance of the model

Prediction

In [41]:
# Prediction
y_pred_xgbdart      = xgb.predict(X_test)
y_prob_pred_xgbdart = xgb.predict_proba(X_test)[:, 1]
print("Y predicted : ", y_pred_xgbdart)
print("Y probability predicted : ", y_prob_pred_xgbdart[:5])
Y predicted :  [False False False ... False False False]
Y probability predicted :  [0.00088362 0.01282003 0.00395442 0.00810351 0.00213888]

Evaluation Metrices

Let's compute various evaluation metrices now

  • Accuracy Score
  • Confusion Matrix
  • Classification Report
  • AUC Score
  • Concodense Index
  • ROC curve
  • PR curve
In [42]:
# Compute Evaluation Metric
compute_evaluation_metric(xgb, X_test, y_test, y_pred_xgbdart, y_prob_pred_xgbdart)
 Accuracy Score :  0.9797247716778993

 AUC Score :  0.9348031681385632

 Confusion Matrix : 
 [[170693    348]
 [  3244   2877]]

 Classification Report : 
               precision    recall  f1-score   support

       False       0.98      1.00      0.99    171041
        True       0.89      0.47      0.62      6121

    accuracy                           0.98    177162
   macro avg       0.94      0.73      0.80    177162
weighted avg       0.98      0.98      0.98    177162


 Concordance Index :  0.9348031547662842

 ROC curve : 

 PR curve : 

In [43]:
# Gains Table and Capture rates plot
captures(y_test, y_pred_xgbdart, y_prob_pred_xgbdart)
Out[43]:
prob_bin not_fraud fraud perc_fraud perc_not_fraud cum_perc_fraud cum_perc_not_fraud
0 Bin1((0.0472, 1.0]) 12835 4882 0.797582 0.075040 79.758209 7.504049
1 Bin2((0.0215, 0.0472]) 17204 512 0.083646 0.100584 88.122856 17.562456
2 Bin3((0.0132, 0.0215]) 17460 256 0.041823 0.102081 92.305179 27.770535
3 Bin4((0.00914, 0.0132]) 17572 144 0.023526 0.102736 94.657736 38.044095
4 Bin5((0.00659, 0.00914]) 17606 109 0.017808 0.102934 96.438490 48.337533
5 Bin6((0.00477, 0.00659]) 17649 68 0.011109 0.103186 97.549420 58.656112
6 Bin7((0.00333, 0.00477]) 17662 54 0.008822 0.103262 98.431629 68.982291
7 Bin8((0.00219, 0.00333]) 17668 48 0.007842 0.103297 99.215814 79.311978
8 Bin9((0.00121, 0.00219]) 17689 27 0.004411 0.103420 99.656919 89.653943
9 Bin10((-0.0009859, 0.00121]) 17696 21 0.003431 0.103461 100.000000 100.000000
In [44]:
draw_calibration_curve(y_test, y_prob_pred_xgbdart, n_bins=10)

Inferences:

  • Both of the boosters are giving the similar result, maybe because both are tree based
  • The accuracy score is 0.97, AUC and concordence scores are 0.88.
  • Recall and f1-score are very less for class True. That's because of class imbalance
  • ROC and PR curve also needs improvements

Let's look at LGBM

15. LightGBM

LightGBM is a gradient boosting framework that uses tree based learning algorithms.

It is designed to be distributed and efficient with the following advantages:

  • Faster training speed and higher efficiency.
  • Lower memory usage.
  • Better accuracy.
  • Support of parallel and GPU learning.
  • Capable of handling large-scale data.
In [ ]:
from lightgbm import LGBMClassifier
In [45]:
%%time
# Define the model
lgbc = LGBMClassifier(random_state=0, n_jobs = -1)

# Train the model
lgbc.fit(X_train,y_train)

lgbc
Wall time: 1min 16s
Out[45]:
LGBMClassifier(random_state=0)

Let's use the model to get predictions on test dataset. We would be looking at the predicted class and predicted probability both in order to evaluate the performance of the model

In [46]:
# Prediction
y_pred_lgbc = lgbc.predict(X_test)
y_prob_pred_lgbc = lgbc.predict_proba(X_test)
y_prob_pred_lgbc = [x[1] for x in y_prob_pred_lgbc]
print("Y predicted : ",y_pred_lgbc)
print("Y probability predicted : ",y_prob_pred_lgbc[:5])
Y predicted :  [False False False ... False False False]
Y probability predicted :  [0.0030038692434659836, 0.0204190888379695, 0.017504758484994488, 0.0071321007747928164, 0.0018421571426377808]

Evaluation Metrices

Let's compute various evaluation metrices now

  • Accuracy Score
  • Confusion Matrix
  • Classification Report
  • AUC Score
  • Concodense Index
  • ROC curve
  • PR curve
In [47]:
# Compute Evaluation Metric
compute_evaluation_metric(lgbc, X_test, y_test, y_pred_lgbc, y_prob_pred_lgbc)
 Accuracy Score :  0.9772355245481537

 AUC Score :  0.9276554552960554

 Confusion Matrix : 
 [[170633    408]
 [  3625   2496]]

 Classification Report : 
               precision    recall  f1-score   support

       False       0.98      1.00      0.99    171041
        True       0.86      0.41      0.55      6121

    accuracy                           0.98    177162
   macro avg       0.92      0.70      0.77    177162
weighted avg       0.98      0.98      0.97    177162


 Concordance Index :  0.9276554118361485

 ROC curve : 

 PR curve : 

In [48]:
# Gains Table and Capture rates plot
captures(y_test, y_pred_lgbc, y_prob_pred_lgbc)
Out[48]:
prob_bin not_fraud fraud perc_fraud perc_not_fraud cum_perc_fraud cum_perc_not_fraud
0 Bin1((0.0529, 0.996]) 12911 4806 0.785166 0.075485 78.516582 7.548483
1 Bin2((0.0253, 0.0529]) 17203 513 0.083810 0.100578 86.897566 17.606305
2 Bin3((0.0157, 0.0253]) 17447 269 0.043947 0.102005 91.292273 27.806783
3 Bin4((0.0111, 0.0157]) 17549 167 0.027283 0.102601 94.020585 38.066896
4 Bin5((0.00852, 0.0111]) 17589 127 0.020748 0.102835 96.095409 48.350396
5 Bin6((0.00665, 0.00852]) 17629 87 0.014213 0.103069 97.516746 58.657281
6 Bin7((0.00507, 0.00665]) 17669 47 0.007678 0.103303 98.284594 68.987553
7 Bin8((0.00374, 0.00507]) 17673 43 0.007025 0.103326 98.987094 79.320163
8 Bin9((0.00266, 0.00374]) 17682 34 0.005555 0.103379 99.542558 89.658035
9 Bin10((-0.000701, 0.00266]) 17689 28 0.004574 0.103420 100.000000 100.000000
In [49]:
draw_calibration_curve(y_test, y_prob_pred_lgbc, n_bins=10)

Inferences:

  • With LGBM, Accuracy score is 97.7%. It's almost similar to XGBoost model
  • AUC score has imporved to 92.6 from 88.5
  • Recall and f-1 score have also improved, but it's still not upto the mark

16. Random Forest Classifier

In [50]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
In [51]:
X_train.head()
Out[51]:
TransactionAmt ProductCD card1 card2 card3 card4 card5 card6 addr1 addr2 dist1 P_emaildomain R_emaildomain C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 D1 D2 D3 D4 D5 D6 D8 D9 D10 D11 D12 D13 D14 D15 M1 M2 M3 M4 M5 M6 M7 M8 M9 id_01 id_02 id_03 id_04 id_05 id_06 id_09 id_10 id_11 id_12 id_13 id_14 id_15 id_16 id_17 id_19 id_20 id_28 id_29 id_30 id_31 id_32 id_33 id_34 id_35 id_36 id_37 id_38 DeviceType DeviceInfo card2_missing_flag card3_missing_flag card4_missing_flag card5_missing_flag card6_missing_flag addr1_missing_flag addr2_missing_flag dist1_missing_flag dist2_missing_flag P_emaildomain_missing_flag R_emaildomain_missing_flag D1_missing_flag D2_missing_flag D3_missing_flag D4_missing_flag D5_missing_flag D6_missing_flag D7_missing_flag D8_missing_flag D9_missing_flag D10_missing_flag D11_missing_flag D12_missing_flag D13_missing_flag D14_missing_flag D15_missing_flag M1_missing_flag M2_missing_flag M3_missing_flag M4_missing_flag M5_missing_flag M6_missing_flag M7_missing_flag M8_missing_flag M9_missing_flag V1_missing_flag V2_missing_flag V3_missing_flag V4_missing_flag V5_missing_flag V6_missing_flag V7_missing_flag V8_missing_flag V9_missing_flag V10_missing_flag V11_missing_flag V12_missing_flag V13_missing_flag V14_missing_flag V15_missing_flag V16_missing_flag V17_missing_flag V18_missing_flag V19_missing_flag V20_missing_flag V21_missing_flag V22_missing_flag V23_missing_flag V24_missing_flag V25_missing_flag V26_missing_flag V27_missing_flag V28_missing_flag V29_missing_flag V30_missing_flag V31_missing_flag V32_missing_flag V33_missing_flag V34_missing_flag V35_missing_flag V36_missing_flag V37_missing_flag V38_missing_flag V39_missing_flag V40_missing_flag V41_missing_flag V42_missing_flag V43_missing_flag V44_missing_flag V45_missing_flag V46_missing_flag V47_missing_flag V48_missing_flag V49_missing_flag V50_missing_flag V51_missing_flag V52_missing_flag V53_missing_flag V54_missing_flag V55_missing_flag V56_missing_flag V57_missing_flag V58_missing_flag V59_missing_flag V60_missing_flag V61_missing_flag V62_missing_flag V63_missing_flag V64_missing_flag V65_missing_flag V66_missing_flag V67_missing_flag V68_missing_flag V69_missing_flag V70_missing_flag V71_missing_flag V72_missing_flag V73_missing_flag V74_missing_flag V75_missing_flag V76_missing_flag V77_missing_flag V78_missing_flag V79_missing_flag V80_missing_flag V81_missing_flag V82_missing_flag V83_missing_flag V84_missing_flag V85_missing_flag V86_missing_flag V87_missing_flag V88_missing_flag V89_missing_flag V90_missing_flag V91_missing_flag V92_missing_flag V93_missing_flag V94_missing_flag V95_missing_flag V96_missing_flag V97_missing_flag V98_missing_flag V99_missing_flag V100_missing_flag V101_missing_flag V102_missing_flag V103_missing_flag V104_missing_flag V105_missing_flag V106_missing_flag V107_missing_flag V108_missing_flag V109_missing_flag V110_missing_flag V111_missing_flag V112_missing_flag V113_missing_flag V114_missing_flag V115_missing_flag V116_missing_flag V117_missing_flag V118_missing_flag V119_missing_flag V120_missing_flag V121_missing_flag V122_missing_flag V123_missing_flag V124_missing_flag V125_missing_flag V126_missing_flag V127_missing_flag V128_missing_flag V129_missing_flag V130_missing_flag V131_missing_flag V132_missing_flag V133_missing_flag V134_missing_flag V135_missing_flag ... V169_missing_flag V170_missing_flag V171_missing_flag V172_missing_flag V173_missing_flag V174_missing_flag V175_missing_flag V176_missing_flag V177_missing_flag V178_missing_flag V179_missing_flag V180_missing_flag V181_missing_flag V182_missing_flag V183_missing_flag V184_missing_flag V185_missing_flag V186_missing_flag V187_missing_flag V188_missing_flag V189_missing_flag V190_missing_flag V191_missing_flag V192_missing_flag V193_missing_flag V194_missing_flag V195_missing_flag V196_missing_flag V197_missing_flag V198_missing_flag V199_missing_flag V200_missing_flag V201_missing_flag V202_missing_flag V203_missing_flag V204_missing_flag V205_missing_flag V206_missing_flag V207_missing_flag V208_missing_flag V209_missing_flag V210_missing_flag V211_missing_flag V212_missing_flag V213_missing_flag V214_missing_flag V215_missing_flag V216_missing_flag V217_missing_flag V218_missing_flag V219_missing_flag V220_missing_flag V221_missing_flag V222_missing_flag V223_missing_flag V224_missing_flag V225_missing_flag V226_missing_flag V227_missing_flag V228_missing_flag V229_missing_flag V230_missing_flag V231_missing_flag V232_missing_flag V233_missing_flag V234_missing_flag V235_missing_flag V236_missing_flag V237_missing_flag V238_missing_flag V239_missing_flag V240_missing_flag V241_missing_flag V242_missing_flag V243_missing_flag V244_missing_flag V245_missing_flag V246_missing_flag V247_missing_flag V248_missing_flag V249_missing_flag V250_missing_flag V251_missing_flag V252_missing_flag V253_missing_flag V254_missing_flag V255_missing_flag V256_missing_flag V257_missing_flag V258_missing_flag V259_missing_flag V260_missing_flag V261_missing_flag V262_missing_flag V263_missing_flag V264_missing_flag V265_missing_flag V266_missing_flag V267_missing_flag V268_missing_flag V269_missing_flag V270_missing_flag V271_missing_flag V272_missing_flag V273_missing_flag V274_missing_flag V275_missing_flag V276_missing_flag V277_missing_flag V278_missing_flag V279_missing_flag V280_missing_flag V281_missing_flag V282_missing_flag V283_missing_flag V284_missing_flag V285_missing_flag V286_missing_flag V287_missing_flag V288_missing_flag V289_missing_flag V290_missing_flag V291_missing_flag V292_missing_flag V293_missing_flag V294_missing_flag V295_missing_flag V296_missing_flag V297_missing_flag V298_missing_flag V299_missing_flag V300_missing_flag V301_missing_flag V302_missing_flag V303_missing_flag V304_missing_flag V305_missing_flag V306_missing_flag V307_missing_flag V308_missing_flag V309_missing_flag V310_missing_flag V311_missing_flag V312_missing_flag V313_missing_flag V314_missing_flag V315_missing_flag V316_missing_flag V317_missing_flag V318_missing_flag V319_missing_flag V320_missing_flag V321_missing_flag V322_missing_flag V323_missing_flag V324_missing_flag V325_missing_flag V326_missing_flag V327_missing_flag V328_missing_flag V329_missing_flag V330_missing_flag V331_missing_flag V332_missing_flag V333_missing_flag V334_missing_flag V335_missing_flag V336_missing_flag V337_missing_flag V338_missing_flag V339_missing_flag id_01_missing_flag id_02_missing_flag id_03_missing_flag id_04_missing_flag id_05_missing_flag id_06_missing_flag id_07_missing_flag id_08_missing_flag id_09_missing_flag id_10_missing_flag id_11_missing_flag id_12_missing_flag id_13_missing_flag id_14_missing_flag id_15_missing_flag id_16_missing_flag id_17_missing_flag id_18_missing_flag id_19_missing_flag id_20_missing_flag id_21_missing_flag id_22_missing_flag id_23_missing_flag id_24_missing_flag id_25_missing_flag id_26_missing_flag id_27_missing_flag id_28_missing_flag id_29_missing_flag id_30_missing_flag id_31_missing_flag id_32_missing_flag id_33_missing_flag id_34_missing_flag id_35_missing_flag id_36_missing_flag id_37_missing_flag id_38_missing_flag DeviceType_missing_flag DeviceInfo_missing_flag _Weekdays _Hours _Days Trans_min_mean Trans_min_std TransactionAmt_to_mean_card1 TransactionAmt_to_mean_card4 TransactionAmt_to_std_card1 TransactionAmt_to_std_card4 PCA_V_0 PCA_V_1 PCA_V_2 PCA_V_3 PCA_V_4 PCA_V_5 PCA_V_6 PCA_V_7 PCA_V_8 PCA_V_9 PCA_V_10 PCA_V_11 PCA_V_12 PCA_V_13 PCA_V_14 PCA_V_15 PCA_V_16 PCA_V_17 PCA_V_18 PCA_V_19 PCA_V_20 PCA_V_21 PCA_V_22 PCA_V_23 PCA_V_24 PCA_V_25 PCA_V_26 PCA_V_27 PCA_V_28 PCA_V_29
464452 3.432 4 15063 514.0 150.0 4 226.0 1 315.0 87.0 18.0 0 2 1.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 2.0 1.0 35.0 35.0 34.0 34.0 34.0 NaN NaN NaN 34.0 34.0 NaN NaN NaN 34.0 1 1 1 3 2 1 0 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 NaN NaN 3 2 NaN NaN NaN 2 2 3 4 NaN NaN 4 2 2 2 2 2 NaN 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 4 18 -104.06 -0.4350 0.1501 0.2324 0.092611 0.13560 -0.9330 0.3564 0.2969 -0.05145 0.012596 0.01012 0.02788 0.1576 -0.053300 0.13940 0.033940 0.021680 -0.04034 0.02574 -0.022110 -0.003386 0.019290 0.013780 -0.007637 -0.031980 0.012856 -0.007412 0.000978 -0.016700 -0.007600 -0.027450 0.00541 -0.006160 0.005920 0.008170
36372 3.717 0 6019 583.0 150.0 4 226.0 1 NaN NaN NaN 0 0 928.0 1301.0 0.0 475.0 0.0 475.0 476.0 688.0 0.0 823.0 667.0 667.0 695.0 300.0 0.0 NaN NaN 0.0 NaN 0.0 NaN NaN NaN NaN 0.0 0.0 302.0 0.0 2 2 2 2 2 2 2 2 2 -5.0 49333.0 0.0 0.0 0.0 -1.0 0.0 0.0 100.0 1 52.0 -360.0 0 0 166.0 410.0 611.0 0 0 4 0 24.0 0.02858 3 1 0 1 0 0 0.0808 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 11 -93.94 -0.3928 0.1823 0.3086 0.103778 0.18000 0.6445 -0.9595 0.6094 0.50200 -0.377700 -0.15760 0.07880 0.1742 1.205000 0.12660 -0.007683 -0.130900 -0.04294 0.00345 0.057530 0.048770 -0.015470 -0.016300 0.073850 0.016130 -0.018260 0.036070 -0.019200 -0.050660 -0.045530 -0.032350 -0.02182 0.038240 -0.009430 0.004864
572387 5.527 4 17188 321.0 150.0 4 226.0 2 299.0 87.0 1.0 6 2 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0.0 1.0 0.0 14.0 1.0 568.0 80.0 1.0 569.0 1.0 NaN NaN NaN 569.0 569.0 NaN NaN NaN 569.0 1 1 1 1 0 1 0 1 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 NaN NaN 3 2 NaN NaN NaN 2 2 3 4 NaN NaN 4 2 2 2 2 2 NaN 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 21 25 116.00 0.4849 1.9370 1.8850 1.348902 1.10000 -0.8013 0.3174 0.2720 -0.02687 0.049070 -0.00634 -0.02972 -0.2195 0.019070 0.03270 -0.009254 -0.009834 0.02269 -0.02753 -0.009140 -0.002127 -0.001934 -0.025130 0.016200 0.040620 0.000128 0.001646 -0.004090 0.010020 0.002650 0.010440 0.00992 0.005665 -0.008720 -0.011665
497276 3.950 4 1214 174.0 150.0 4 226.0 1 204.0 87.0 NaN 4 2 1.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN NaN 0.0 NaN NaN NaN NaN 0.0 0.0 NaN NaN NaN 0.0 1 1 1 3 2 0 0 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 NaN NaN 3 2 NaN NaN NaN 2 2 3 4 NaN NaN 4 2 2 2 2 2 NaN 0 0 0 0 0 0 0 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 19 30 -83.06 -0.3474 0.2540 0.3901 0.147945 0.22770 -0.9326 0.3564 0.2974 -0.05154 0.010610 0.00979 0.02590 0.1611 -0.054500 0.13270 0.035740 -0.010574 -0.00642 -0.03049 -0.012240 0.017960 0.018100 0.002857 -0.016740 0.007100 0.018840 -0.006878 0.008800 -0.004406 0.000245 -0.014570 -0.00884 -0.011110 -0.000597 -0.007458
97470 3.219 1 12695 490.0 150.0 4 226.0 2 325.0 87.0 NaN 6 6 1.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 2 2 3 2 2 2 2 2 -5.0 68281.0 NaN NaN 0.0 0.0 NaN NaN 100.0 1 52.0 NaN 1 1 225.0 266.0 507.0 1 1 3 0 NaN NaN 4 0 0 1 0 0 NaN 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 0 0 0 0 1 5 21 23 -110.00 -0.4600 0.1771 0.1877 0.116133 0.10956 2.8070 0.3457 0.3772 -0.01767 -0.042420 -0.24430 0.02310 0.0118 0.000914 0.02304 -0.009550 -0.022300 -0.02167 -0.01366 -0.005627 -0.009690 0.001454 -0.008240 -0.008900 0.005936 -0.004475 -0.000716 -0.003060 0.008514 -0.017840 0.000079 0.01335 -0.005287 -0.000648 0.007446

5 rows × 533 columns

Impute Missing values. Since sklearn algos are not designed to handle missing values.

In [52]:
from sklearn.impute import KNNImputer, SimpleImputer

# replace inf
X_train = X_train.replace([np.inf, -np.inf], np.nan)
X_test = X_test.replace([np.inf, -np.inf], np.nan)

# Impute
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
# imputer = KNNImputer(n_neighbors=3)

X_train_imputed = imputer.fit_transform(X_train)
X_train_imputed = pd.DataFrame(X_train_imputed, columns=X_train.columns)
X_train_imputed.head()
Out[52]:
TransactionAmt ProductCD card1 card2 card3 card4 card5 card6 addr1 addr2 dist1 P_emaildomain R_emaildomain C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 D1 D2 D3 D4 D5 D6 D8 D9 D10 D11 D12 D13 D14 D15 M1 M2 M3 M4 M5 M6 M7 M8 M9 id_01 id_02 id_03 id_04 id_05 id_06 id_09 id_10 id_11 id_12 id_13 id_14 id_15 id_16 id_17 id_19 id_20 id_28 id_29 id_30 id_31 id_32 id_33 id_34 id_35 id_36 id_37 id_38 DeviceType DeviceInfo card2_missing_flag card3_missing_flag card4_missing_flag card5_missing_flag card6_missing_flag addr1_missing_flag addr2_missing_flag dist1_missing_flag dist2_missing_flag P_emaildomain_missing_flag R_emaildomain_missing_flag D1_missing_flag D2_missing_flag D3_missing_flag D4_missing_flag D5_missing_flag D6_missing_flag D7_missing_flag D8_missing_flag D9_missing_flag D10_missing_flag D11_missing_flag D12_missing_flag D13_missing_flag D14_missing_flag D15_missing_flag M1_missing_flag M2_missing_flag M3_missing_flag M4_missing_flag M5_missing_flag M6_missing_flag M7_missing_flag M8_missing_flag M9_missing_flag V1_missing_flag V2_missing_flag V3_missing_flag V4_missing_flag V5_missing_flag V6_missing_flag V7_missing_flag V8_missing_flag V9_missing_flag V10_missing_flag V11_missing_flag V12_missing_flag V13_missing_flag V14_missing_flag V15_missing_flag V16_missing_flag V17_missing_flag V18_missing_flag V19_missing_flag V20_missing_flag V21_missing_flag V22_missing_flag V23_missing_flag V24_missing_flag V25_missing_flag V26_missing_flag V27_missing_flag V28_missing_flag V29_missing_flag V30_missing_flag V31_missing_flag V32_missing_flag V33_missing_flag V34_missing_flag V35_missing_flag V36_missing_flag V37_missing_flag V38_missing_flag V39_missing_flag V40_missing_flag V41_missing_flag V42_missing_flag V43_missing_flag V44_missing_flag V45_missing_flag V46_missing_flag V47_missing_flag V48_missing_flag V49_missing_flag V50_missing_flag V51_missing_flag V52_missing_flag V53_missing_flag V54_missing_flag V55_missing_flag V56_missing_flag V57_missing_flag V58_missing_flag V59_missing_flag V60_missing_flag V61_missing_flag V62_missing_flag V63_missing_flag V64_missing_flag V65_missing_flag V66_missing_flag V67_missing_flag V68_missing_flag V69_missing_flag V70_missing_flag V71_missing_flag V72_missing_flag V73_missing_flag V74_missing_flag V75_missing_flag V76_missing_flag V77_missing_flag V78_missing_flag V79_missing_flag V80_missing_flag V81_missing_flag V82_missing_flag V83_missing_flag V84_missing_flag V85_missing_flag V86_missing_flag V87_missing_flag V88_missing_flag V89_missing_flag V90_missing_flag V91_missing_flag V92_missing_flag V93_missing_flag V94_missing_flag V95_missing_flag V96_missing_flag V97_missing_flag V98_missing_flag V99_missing_flag V100_missing_flag V101_missing_flag V102_missing_flag V103_missing_flag V104_missing_flag V105_missing_flag V106_missing_flag V107_missing_flag V108_missing_flag V109_missing_flag V110_missing_flag V111_missing_flag V112_missing_flag V113_missing_flag V114_missing_flag V115_missing_flag V116_missing_flag V117_missing_flag V118_missing_flag V119_missing_flag V120_missing_flag V121_missing_flag V122_missing_flag V123_missing_flag V124_missing_flag V125_missing_flag V126_missing_flag V127_missing_flag V128_missing_flag V129_missing_flag V130_missing_flag V131_missing_flag V132_missing_flag V133_missing_flag V134_missing_flag V135_missing_flag ... V169_missing_flag V170_missing_flag V171_missing_flag V172_missing_flag V173_missing_flag V174_missing_flag V175_missing_flag V176_missing_flag V177_missing_flag V178_missing_flag V179_missing_flag V180_missing_flag V181_missing_flag V182_missing_flag V183_missing_flag V184_missing_flag V185_missing_flag V186_missing_flag V187_missing_flag V188_missing_flag V189_missing_flag V190_missing_flag V191_missing_flag V192_missing_flag V193_missing_flag V194_missing_flag V195_missing_flag V196_missing_flag V197_missing_flag V198_missing_flag V199_missing_flag V200_missing_flag V201_missing_flag V202_missing_flag V203_missing_flag V204_missing_flag V205_missing_flag V206_missing_flag V207_missing_flag V208_missing_flag V209_missing_flag V210_missing_flag V211_missing_flag V212_missing_flag V213_missing_flag V214_missing_flag V215_missing_flag V216_missing_flag V217_missing_flag V218_missing_flag V219_missing_flag V220_missing_flag V221_missing_flag V222_missing_flag V223_missing_flag V224_missing_flag V225_missing_flag V226_missing_flag V227_missing_flag V228_missing_flag V229_missing_flag V230_missing_flag V231_missing_flag V232_missing_flag V233_missing_flag V234_missing_flag V235_missing_flag V236_missing_flag V237_missing_flag V238_missing_flag V239_missing_flag V240_missing_flag V241_missing_flag V242_missing_flag V243_missing_flag V244_missing_flag V245_missing_flag V246_missing_flag V247_missing_flag V248_missing_flag V249_missing_flag V250_missing_flag V251_missing_flag V252_missing_flag V253_missing_flag V254_missing_flag V255_missing_flag V256_missing_flag V257_missing_flag V258_missing_flag V259_missing_flag V260_missing_flag V261_missing_flag V262_missing_flag V263_missing_flag V264_missing_flag V265_missing_flag V266_missing_flag V267_missing_flag V268_missing_flag V269_missing_flag V270_missing_flag V271_missing_flag V272_missing_flag V273_missing_flag V274_missing_flag V275_missing_flag V276_missing_flag V277_missing_flag V278_missing_flag V279_missing_flag V280_missing_flag V281_missing_flag V282_missing_flag V283_missing_flag V284_missing_flag V285_missing_flag V286_missing_flag V287_missing_flag V288_missing_flag V289_missing_flag V290_missing_flag V291_missing_flag V292_missing_flag V293_missing_flag V294_missing_flag V295_missing_flag V296_missing_flag V297_missing_flag V298_missing_flag V299_missing_flag V300_missing_flag V301_missing_flag V302_missing_flag V303_missing_flag V304_missing_flag V305_missing_flag V306_missing_flag V307_missing_flag V308_missing_flag V309_missing_flag V310_missing_flag V311_missing_flag V312_missing_flag V313_missing_flag V314_missing_flag V315_missing_flag V316_missing_flag V317_missing_flag V318_missing_flag V319_missing_flag V320_missing_flag V321_missing_flag V322_missing_flag V323_missing_flag V324_missing_flag V325_missing_flag V326_missing_flag V327_missing_flag V328_missing_flag V329_missing_flag V330_missing_flag V331_missing_flag V332_missing_flag V333_missing_flag V334_missing_flag V335_missing_flag V336_missing_flag V337_missing_flag V338_missing_flag V339_missing_flag id_01_missing_flag id_02_missing_flag id_03_missing_flag id_04_missing_flag id_05_missing_flag id_06_missing_flag id_07_missing_flag id_08_missing_flag id_09_missing_flag id_10_missing_flag id_11_missing_flag id_12_missing_flag id_13_missing_flag id_14_missing_flag id_15_missing_flag id_16_missing_flag id_17_missing_flag id_18_missing_flag id_19_missing_flag id_20_missing_flag id_21_missing_flag id_22_missing_flag id_23_missing_flag id_24_missing_flag id_25_missing_flag id_26_missing_flag id_27_missing_flag id_28_missing_flag id_29_missing_flag id_30_missing_flag id_31_missing_flag id_32_missing_flag id_33_missing_flag id_34_missing_flag id_35_missing_flag id_36_missing_flag id_37_missing_flag id_38_missing_flag DeviceType_missing_flag DeviceInfo_missing_flag _Weekdays _Hours _Days Trans_min_mean Trans_min_std TransactionAmt_to_mean_card1 TransactionAmt_to_mean_card4 TransactionAmt_to_std_card1 TransactionAmt_to_std_card4 PCA_V_0 PCA_V_1 PCA_V_2 PCA_V_3 PCA_V_4 PCA_V_5 PCA_V_6 PCA_V_7 PCA_V_8 PCA_V_9 PCA_V_10 PCA_V_11 PCA_V_12 PCA_V_13 PCA_V_14 PCA_V_15 PCA_V_16 PCA_V_17 PCA_V_18 PCA_V_19 PCA_V_20 PCA_V_21 PCA_V_22 PCA_V_23 PCA_V_24 PCA_V_25 PCA_V_26 PCA_V_27 PCA_V_28 PCA_V_29
0 3.432 4.0 15063.0 514.0 150.0 4.0 226.0 1.0 315.00000 87.000000 18.00000 0.0 2.0 1.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 2.0 1.0 35.0 35.000000 34.000000 34.000000 34.000000 69.466886 145.896923 0.562258 34.000000 34.00000 53.870766 17.999492 57.910973 34.000000 1.0 1.0 1.0 3.0 2.0 1.0 0.0 0.0 1.0 -10.167535 175100.840652 0.063532 -0.055814 1.628387 -6.764911 0.095223 -0.30179 99.745253 2.0 48.029567 -344.788868 3.0 2.0 189.486242 352.986677 403.654199 2.0 2.0 3.0 4.0 26.519707 0.011556 4.0 2.0 2.0 2.0 2.0 2.0 0.041219 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 4.0 18.0 -104.06 -0.4350 0.1501 0.2324 0.092611 0.13560 -0.9330 0.3564 0.2969 -0.05145 0.012596 0.01012 0.02788 0.1576 -0.053300 0.13940 0.033940 0.021680 -0.04034 0.02574 -0.022110 -0.003386 0.019290 0.013780 -0.007637 -0.031980 0.012856 -0.007412 0.000978 -0.016700 -0.007600 -0.027450 0.00541 -0.006160 0.005920 0.008170
1 3.717 0.0 6019.0 583.0 150.0 4.0 226.0 1.0 290.73683 86.796065 119.26547 0.0 0.0 928.0 1301.0 0.0 475.0 0.0 475.0 476.0 688.0 0.0 823.0 667.0 667.0 695.0 300.0 0.0 169.661838 28.353969 0.000000 42.323807 0.000000 145.896923 0.562258 123.914942 146.43237 0.000000 0.000000 302.000000 0.000000 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 -5.000000 49333.000000 0.000000 0.000000 0.000000 -1.000000 0.000000 0.00000 100.000000 1.0 52.000000 -360.000000 0.0 0.0 166.000000 410.000000 611.000000 0.0 0.0 4.0 0.0 24.000000 0.028580 3.0 1.0 0.0 1.0 0.0 0.0 0.080800 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 11.0 -93.94 -0.3928 0.1823 0.3086 0.103778 0.18000 0.6445 -0.9595 0.6094 0.50200 -0.377700 -0.15760 0.07880 0.1742 1.205000 0.12660 -0.007683 -0.130900 -0.04294 0.00345 0.057530 0.048770 -0.015470 -0.016300 0.073850 0.016130 -0.018260 0.036070 -0.019200 -0.050660 -0.045530 -0.032350 -0.02182 0.038240 -0.009430 0.004864
2 5.527 4.0 17188.0 321.0 150.0 4.0 226.0 2.0 299.00000 87.000000 1.00000 6.0 2.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0.0 1.0 0.0 14.0 1.0 568.0 80.000000 1.000000 569.000000 1.000000 69.466886 145.896923 0.562258 569.000000 569.00000 53.870766 17.999492 57.910973 569.000000 1.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 -10.167535 175100.840652 0.063532 -0.055814 1.628387 -6.764911 0.095223 -0.30179 99.745253 2.0 48.029567 -344.788868 3.0 2.0 189.486242 352.986677 403.654199 2.0 2.0 3.0 4.0 26.519707 0.011556 4.0 2.0 2.0 2.0 2.0 2.0 0.041219 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 4.0 21.0 25.0 116.00 0.4849 1.9370 1.8850 1.348902 1.10000 -0.8013 0.3174 0.2720 -0.02687 0.049070 -0.00634 -0.02972 -0.2195 0.019070 0.03270 -0.009254 -0.009834 0.02269 -0.02753 -0.009140 -0.002127 -0.001934 -0.025130 0.016200 0.040620 0.000128 0.001646 -0.004090 0.010020 0.002650 0.010440 0.00992 0.005665 -0.008720 -0.011665
3 3.950 4.0 1214.0 174.0 150.0 4.0 226.0 1.0 204.00000 87.000000 119.26547 4.0 2.0 1.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 169.661838 28.353969 0.000000 42.323807 69.466886 145.896923 0.562258 0.000000 0.00000 53.870766 17.999492 57.910973 0.000000 1.0 1.0 1.0 3.0 2.0 0.0 0.0 0.0 1.0 -10.167535 175100.840652 0.063532 -0.055814 1.628387 -6.764911 0.095223 -0.30179 99.745253 2.0 48.029567 -344.788868 3.0 2.0 189.486242 352.986677 403.654199 2.0 2.0 3.0 4.0 26.519707 0.011556 4.0 2.0 2.0 2.0 2.0 2.0 0.041219 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 19.0 30.0 -83.06 -0.3474 0.2540 0.3901 0.147945 0.22770 -0.9326 0.3564 0.2974 -0.05154 0.010610 0.00979 0.02590 0.1611 -0.054500 0.13270 0.035740 -0.010574 -0.00642 -0.03049 -0.012240 0.017960 0.018100 0.002857 -0.016740 0.007100 0.018840 -0.006878 0.008800 -0.004406 0.000245 -0.014570 -0.00884 -0.011110 -0.000597 -0.007458
4 3.219 1.0 12695.0 490.0 150.0 4.0 226.0 2.0 325.00000 87.000000 119.26547 6.0 6.0 1.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 0.0 169.661838 28.353969 139.771485 42.323807 69.466886 145.896923 0.562258 123.914942 146.43237 53.870766 17.999492 57.910973 163.670779 2.0 2.0 2.0 3.0 2.0 2.0 2.0 2.0 2.0 -5.000000 68281.000000 0.063532 -0.055814 0.000000 0.000000 0.095223 -0.30179 100.000000 1.0 52.000000 -344.788868 1.0 1.0 225.000000 266.000000 507.000000 1.0 1.0 3.0 0.0 26.519707 0.011556 4.0 0.0 0.0 1.0 0.0 0.0 0.041219 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 5.0 21.0 23.0 -110.00 -0.4600 0.1771 0.1877 0.116133 0.10956 2.8070 0.3457 0.3772 -0.01767 -0.042420 -0.24430 0.02310 0.0118 0.000914 0.02304 -0.009550 -0.022300 -0.02167 -0.01366 -0.005627 -0.009690 0.001454 -0.008240 -0.008900 0.005936 -0.004475 -0.000716 -0.003060 0.008514 -0.017840 0.000079 0.01335 -0.005287 -0.000648 0.007446

5 rows × 533 columns

Build and train the Classifier

In [101]:
%%time
# Define the model
rfc = RandomForestClassifier(random_state=0, n_jobs = -1)
# rfc = ExtraTreesClassifier(random_state=0, n_jobs = -1)
# rfc = AdaBoostClassifier(random_state=0)
# rfc = GradientBoostingClassifier(random_state=0)

# Train the model
rfc.fit(X_train_imputed, y_train)

rfc

Predicting on test data

In [54]:
# Impute X_Test before predicting
X_test_imputed = imputer.transform(X_test)

# Prediction
y_pred_rfc = rfc.predict(X_test_imputed)
y_prob_pred_rfc = rfc.predict_proba(X_test_imputed)[:, 1]

print("Y predicted : ",y_pred_rfc)
print("Y probability predicted : ",y_prob_pred_rfc[:5])
Y predicted :  [False False False ... False False False]
Y probability predicted :  [0.01302948 0.03184159 0.0321819  0.01066897 0.00670777]

Evaluation metrics

In [55]:
# Compute Evaluation Metric
compute_evaluation_metric(rfc, X_test_imputed, y_test, y_pred_rfc, y_prob_pred_rfc)
 Accuracy Score :  0.9736173671554849

 AUC Score :  0.8775531354407143

 Confusion Matrix : 
 [[170697    344]
 [  4330   1791]]

 Classification Report : 
               precision    recall  f1-score   support

       False       0.98      1.00      0.99    171041
        True       0.84      0.29      0.43      6121

    accuracy                           0.97    177162
   macro avg       0.91      0.65      0.71    177162
weighted avg       0.97      0.97      0.97    177162


 Concordance Index :  0.8775057827680287

 ROC curve : 

 PR curve : 

In [56]:
# Concordance
concordance(y_test.values, y_prob_pred_rfc)
Out[56]:
0.8775057827680287
In [57]:
# Gains Table and Capture rates plot
captures(y_test, y_pred_rfc, y_prob_pred_rfc)
Out[57]:
prob_bin not_fraud fraud perc_fraud perc_not_fraud cum_perc_fraud cum_perc_not_fraud
0 Bin1((0.0532, 0.999]) 13669 4048 0.661330 0.079917 66.132985 7.991651
1 Bin2((0.0311, 0.0532]) 17001 715 0.116811 0.099397 77.814083 17.931373
2 Bin3((0.0213, 0.0311]) 17318 398 0.065022 0.101251 84.316288 28.056431
3 Bin4((0.016, 0.0213]) 17441 275 0.044927 0.101970 88.809018 38.253401
4 Bin5((0.013, 0.016]) 17504 182 0.029734 0.102338 91.782388 48.487205
5 Bin6((0.0114, 0.013]) 17574 172 0.028100 0.102747 94.592387 58.761934
6 Bin7((0.0097, 0.0114]) 17552 109 0.017808 0.102619 96.373142 69.023801
7 Bin8((0.00801, 0.0097]) 17682 88 0.014377 0.103379 97.810815 79.361674
8 Bin9((0.0068, 0.00801]) 17636 81 0.013233 0.103110 99.134128 89.672652
9 Bin10((0.00179, 0.0068]) 17664 53 0.008659 0.103273 100.000000 100.000000
In [58]:
draw_calibration_curve(y_test, y_prob_pred_rfc, n_bins=10)

17. Handling Class Imbalance

Handle Class Imbalance with Random Oversampler

Imbalanced classes are a common problem in machine learning classification where there are a disproportionate ratio of observations in each class.

Most machine learning algorithms work best when the number of samples in each class are about equal. This is because most algorithms are designed to maximize accuracy and reduce error.

  • Upsample
In [59]:
# random over sampler
ros = RandomOverSampler()
X_train_ros, y_train_ros = ros.fit_sample(X_train_imputed, y_train)
y_train_ros.value_counts()
Out[59]:
True     398836
False    398836
Name: isFraud, dtype: int64
In [61]:
%%time
# Define the model
lgbc_ros = LGBMClassifier(random_state=0)

# Train the model
lgbc_ros.fit(X_train_ros,y_train_ros)

lgbc_ros
Wall time: 3min 20s
Out[61]:
LGBMClassifier(random_state=0)

Let's use the model to get predictions on test dataset. We would be looking at the predicted class and predicted probability both in order to evaluate the performance of the model

In [62]:
# Prediction on the original test dataset
y_pred_lgbcros = lgbc_ros.predict(X_test_imputed)
y_prob_pred_lgbcros = lgbc_ros.predict_proba(X_test_imputed)[:, 1]

print("Y predicted : ",y_pred_lgbcros)
print("Y probability predicted : ",y_prob_pred_lgbcros[:5])
Y predicted :  [False False False ... False False  True]
Y probability predicted :  [0.04800962 0.30784135 0.30226978 0.104903   0.03731646]

Evaluation Metrics

Let's compute various evaluation metrices now

  • Accuracy Score
  • Confusion Matrix
  • Classification Report
  • AUC Score
  • Concodense Index
  • ROC curve
  • PR curve
In [63]:
# Compute Evaluation Metric
compute_evaluation_metric(lgbc_ros, X_test_imputed, y_test, y_pred_lgbcros, y_prob_pred_lgbcros)
 Accuracy Score :  0.8851503144015083

 AUC Score :  0.9252961635759673

 Confusion Matrix : 
 [[151857  19184]
 [  1163   4958]]

 Classification Report : 
               precision    recall  f1-score   support

       False       0.99      0.89      0.94    171041
        True       0.21      0.81      0.33      6121

    accuracy                           0.89    177162
   macro avg       0.60      0.85      0.63    177162
weighted avg       0.97      0.89      0.92    177162


 Concordance Index :  0.9252961416072233

 ROC curve : 

 PR curve : 

In [64]:
# Gains Table and Capture rates plot
captures(y_test, y_pred_lgbcros, y_prob_pred_lgbcros)
Out[64]:
prob_bin not_fraud fraud perc_fraud perc_not_fraud cum_perc_fraud cum_perc_not_fraud
0 Bin1((0.587, 0.998]) 13072 4645 0.758863 0.076426 75.886293 7.642612
1 Bin2((0.386, 0.587]) 17080 636 0.103905 0.099859 86.276752 17.628522
2 Bin3((0.269, 0.386]) 17398 318 0.051952 0.101718 91.471982 27.800352
3 Bin4((0.204, 0.269]) 17520 196 0.032021 0.102432 94.674073 38.043510
4 Bin5((0.16, 0.204]) 17593 123 0.020095 0.102858 96.683548 48.329348
5 Bin6((0.126, 0.16]) 17650 66 0.010783 0.103192 97.761804 58.648511
6 Bin7((0.0985, 0.126]) 17659 57 0.009312 0.103244 98.693024 68.972936
7 Bin8((0.0746, 0.0985]) 17681 35 0.005718 0.103373 99.264826 79.310224
8 Bin9((0.0519, 0.0746]) 17686 30 0.004901 0.103402 99.754942 89.650435
9 Bin10((0.0035399999999999997, 0.0519]) 17702 15 0.002451 0.103496 100.000000 100.000000
In [65]:
draw_calibration_curve(y_test, y_prob_pred_lgbcros, n_bins=10)

Inferences:

  • After balancing the class, accuracy score is 0.88 and AUC score is 92.5%
  • Accuracy has decreased as compared to the previos model, but AUC has improved
  • Additionally the recall has improved significantly at the cost of precision.

18. Cost Sensitive Learning with Class weights

The 'balanced' mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

In [103]:
%%time
# Define the model
lgbc_bal = LGBMClassifier(random_state=0, class_weight='balanced')

# Train the model
lgbc_bal.fit(X_train_imputed, y_train)

lgbc_bal
Wall time: 1min 30s
Out[103]:
LGBMClassifier(class_weight='balanced', random_state=0)
In [104]:
# Prediction
y_pred_lgbcbal = lgbc_bal.predict(X_test)
y_prob_pred_lgbcbal = lgbc_bal.predict_proba(X_test)[:, 1]

print("Y predicted : ",y_pred_lgbcbal)
print("Y probability predicted : ",y_prob_pred_lgbcbal[:5])
Y predicted :  [False False False ... False False  True]
Y probability predicted :  [0.17759182 0.43730851 0.42168771 0.41189862 0.17474149]
In [105]:
# Compute Evaluation Metric
compute_evaluation_metric(lgbc_bal, X_test, y_test, y_pred_lgbcbal, y_prob_pred_lgbcbal)
 Accuracy Score :  0.7243144692428399

 AUC Score :  0.9084094113408069

 Confusion Matrix : 
 [[122883  48158]
 [   683   5438]]

 Classification Report : 
               precision    recall  f1-score   support

       False       0.99      0.72      0.83    171041
        True       0.10      0.89      0.18      6121

    accuracy                           0.72    177162
   macro avg       0.55      0.80      0.51    177162
weighted avg       0.96      0.72      0.81    177162


 Concordance Index :  0.9084093946254581

 ROC curve : 

 PR curve : 

In [108]:
# Gains Table and Capture rates plot
captures(y_test, y_pred_lgbcbal, y_prob_pred_lgbcbal)
Out[108]:
prob_bin not_fraud fraud perc_fraud perc_not_fraud cum_perc_fraud cum_perc_not_fraud
0 Bin1((0.71, 0.998]) 13398 4319 0.705604 0.078332 70.560366 7.833210
1 Bin2((0.592, 0.71]) 16984 732 0.119588 0.099298 82.519196 17.762992
2 Bin3((0.502, 0.592]) 17337 379 0.061918 0.101362 88.710995 27.899159
3 Bin4((0.42, 0.502]) 17452 263 0.042967 0.102034 93.007678 38.102560
4 Bin5((0.345, 0.42]) 17566 151 0.024669 0.102701 95.474596 48.372612
5 Bin6((0.276, 0.345]) 17603 113 0.018461 0.102917 97.320699 58.664297
6 Bin7((0.212, 0.276]) 17643 73 0.011926 0.103151 98.513315 68.979368
7 Bin8((0.155, 0.212]) 17674 42 0.006862 0.103332 99.199477 79.312562
8 Bin9((0.102, 0.155]) 17680 36 0.005881 0.103367 99.787616 89.649265
9 Bin10((0.00214, 0.102]) 17704 13 0.002124 0.103507 100.000000 100.000000
In [109]:
draw_calibration_curve(y_test, y_prob_pred_lgbcbal, n_bins=10)

19. Model Calibration

In [22]:
from sklearn.calibration import CalibratedClassifierCV
In [72]:
lgbc_bal = LGBMClassifier(random_state=0)
calibrated_clf = CalibratedClassifierCV(base_estimator=lgbc_bal, cv=3, method='sigmoid')
calibrated_clf.fit(X_train_imputed, y_train)
Out[72]:
CalibratedClassifierCV(base_estimator=LGBMClassifier(random_state=0), cv=3)
In [73]:
# Prediction
y_pred_calib = calibrated_clf.predict(X_test)
y_prob_pred_calib = calibrated_clf.predict_proba(X_test)[:, 1]
In [74]:
len(calibrated_clf.calibrated_classifiers_)
Out[74]:
3
In [75]:
print("Y predicted : ", y_pred_calib)
print("Y probability predicted : ", y_prob_pred_calib[:5])
Y predicted :  [False False False ... False False  True]
Y probability predicted :  [0.01690884 0.01522014 0.017496   0.01598323 0.01401708]
In [76]:
# Compute Evaluation Metric
compute_evaluation_metric(calibrated_clf, X_test_imputed, y_test, y_pred_calib, y_prob_pred_calib)
 Accuracy Score :  0.976259017170725

 AUC Score :  0.9181559291804907

 Confusion Matrix : 
 [[170242    799]
 [  3407   2714]]

 Classification Report : 
               precision    recall  f1-score   support

       False       0.98      1.00      0.99    171041
        True       0.77      0.44      0.56      6121

    accuracy                           0.98    177162
   macro avg       0.88      0.72      0.78    177162
weighted avg       0.97      0.98      0.97    177162


 Concordance Index :  0.9181559244046767

 ROC curve : 

 PR curve : 

In [77]:
# Gains Table and Capture rates plot
captures(y_test, y_pred_calib, y_prob_pred_calib)
Out[77]:
prob_bin not_fraud fraud perc_fraud perc_not_fraud cum_perc_fraud cum_perc_not_fraud
0 Bin1((0.0513, 0.998]) 13138 4579 0.748080 0.076812 74.808038 7.681199
1 Bin2((0.0278, 0.0513]) 17093 623 0.101781 0.099935 84.986113 17.674710
2 Bin3((0.0217, 0.0278]) 17408 308 0.050319 0.101777 90.017971 27.852386
3 Bin4((0.0189, 0.0217]) 17490 226 0.036922 0.102256 93.710178 38.078005
4 Bin5((0.0172, 0.0189]) 17598 118 0.019278 0.102888 95.637968 48.366766
5 Bin6((0.016, 0.0172]) 17629 87 0.014213 0.103069 97.059304 58.673651
6 Bin7((0.0152, 0.016]) 17649 67 0.010946 0.103186 98.153896 68.992230
7 Bin8((0.0145, 0.0152]) 17664 52 0.008495 0.103273 99.003431 79.319578
8 Bin9((0.0138, 0.0145]) 17678 38 0.006208 0.103355 99.624244 89.655112
9 Bin10((0.011800000000000001, 0.0138]) 17694 23 0.003758 0.103449 100.000000 100.000000
In [78]:
draw_calibration_curve(y_test, y_prob_pred_calib, n_bins=10)

20. Model Tuning

Hyperparameter is a parameter that governs how the algorithm trains to learn the relationships. The values are set before the learning process begins.

Hyperparameter tuning refers to the automatic optimization of the hyper-parameters of a ML model.

In [10]:
%%time
# Define the estimator
lgbmclassifier = LGBMClassifier(random_state=0)

# Define the parameters gird
param_grid = {
    'n_estimator'   : [100,200],            # default: 100
    'num_leaves'    : [256,128],            # default: 256
    'max_depth'     : [5, 8],               # default: 8 
    'learning_rate' : [0.05, 0.1],          # default: .1
    'reg_alpha'     : [0 .1, 0.5],          # default: .5
    'class_weight'  : ['balanced', None],
}

# run grid search
grid = GridSearchCV(lgbmclassifier, param_grid=param_grid, refit = True, verbose = 3, n_jobs=-1,cv = 3)
  
# fit the model for grid search 
grid.fit(X_train, y_train)
Fitting 3 folds for each of 64 candidates, totalling 192 fits
Wall time: 2h 27min 26s
Out[10]:
GridSearchCV(cv=3, estimator=LGBMClassifier(random_state=0), n_jobs=-1,
             param_grid={'class_weight': ['balanced', None],
                         'learning_rate': [0.05, 0.1], 'max_depth': [5, 8],
                         'n_estimator': [100, 200], 'num_leaves': [256, 128],
                         'reg_alpha': [0.1, 0.5]},
             verbose=3)

Get the best parameters corresponding to which you have best model

In [11]:
# Best parameter after hyper parameter tuning 
print(grid.best_params_) 
  
# Moel Parameters 
print(grid.best_estimator_)

lgbmclassifier = grid.best_estimator_
{'class_weight': None, 'learning_rate': 0.1, 'max_depth': 8, 'n_estimator': 100, 'num_leaves': 256, 'reg_alpha': 0.5}
LGBMClassifier(max_depth=8, n_estimator=100, num_leaves=256, random_state=0,
               reg_alpha=0.5)

Let's use the best model to get predictions on test dataset. We would be looking at the predicted class and predicted probability both in order to evaluate the performance of the model

In [12]:
# Prediction using best parameters
y_grid_pred = lgbmclassifier.predict(X_test)
y_prob_grid_pred = lgbmclassifier.predict_proba(X_test)[:, 1]
print("Y predicted : ",y_grid_pred)
print("Y probability predicted : ",y_prob_grid_pred[:5])
Y predicted :  [False False False ... False False False]
Y probability predicted :  [0.00210893 0.02470222 0.01490586 0.00864042 0.00098105]

Evaluation Metrics

Let's compute various evaluation metrices now

  • Accuracy Score
  • Confusion Matrix
  • Classification Report
  • AUC Score
  • Concodense Index
  • ROC curve
  • PR curve
In [19]:
# Compute Evaluation Metric
compute_evaluation_metric(lgbmclassifier, X_test, y_test, y_grid_pred, y_prob_grid_pred)
 Accuracy Score :  0.9797755726397309

 AUC Score :  0.9409700338679997

 Confusion Matrix : 
 [[170734    307]
 [  3276   2845]]

 Classification Report : 
               precision    recall  f1-score   support

       False       0.98      1.00      0.99    171041
        True       0.90      0.46      0.61      6121

    accuracy                           0.98    177162
   macro avg       0.94      0.73      0.80    177162
weighted avg       0.98      0.98      0.98    177162


 Concordance Index :  0.9409700247939532

 ROC curve : 

 PR curve : 

Calibration Curve

In [20]:
draw_calibration_curve(y_test, y_prob_grid_pred, n_bins=10)

Calibrate the model

In [24]:
# Calibrate
calibrated_clf = CalibratedClassifierCV(base_estimator=lgbmclassifier, cv=3)
calibrated_clf.fit(X_train, y_train)
y_pred_calib = calibrated_clf.predict(X_test)
y_prob_pred_calib = calibrated_clf.predict_proba(X_test)[:, 1]
In [26]:
draw_calibration_curve(y_test, y_prob_pred_calib, n_bins=10)
In [25]:
# Compute Evaluation Metric
compute_evaluation_metric(calibrated_clf, X_test, y_test, y_pred_calib, y_prob_pred_calib)
 Accuracy Score :  0.9801763357830686

 AUC Score :  0.9425583129340254

 Confusion Matrix : 
 [[170598    443]
 [  3069   3052]]

 Classification Report : 
               precision    recall  f1-score   support

       False       0.98      1.00      0.99    171041
        True       0.87      0.50      0.63      6121

    accuracy                           0.98    177162
   macro avg       0.93      0.75      0.81    177162
weighted avg       0.98      0.98      0.98    177162


 Concordance Index :  0.9425583124564438

 ROC curve : 

 PR curve : 

In [27]:
# Gains Table and Capture rates plot
captures(y_test, y_pred_calib, y_prob_pred_calib)
Out[27]:
prob_bin not_fraud fraud perc_fraud perc_not_fraud cum_perc_fraud cum_perc_not_fraud
0 Bin1((0.0194, 0.999]) 12683 5034 0.822415 0.074152 82.241464 7.415181
1 Bin2((0.0146, 0.0194]) 17276 440 0.071884 0.101005 89.429832 17.515683
2 Bin3((0.0134, 0.0146]) 17484 232 0.037902 0.102221 93.220062 27.737794
3 Bin4((0.0128, 0.0134]) 17576 140 0.022872 0.102759 95.507270 38.013693
4 Bin5((0.0125, 0.0128]) 17620 96 0.015684 0.103016 97.075641 48.315316
5 Bin6((0.0122, 0.0125]) 17654 62 0.010129 0.103215 98.088548 58.636818
6 Bin7((0.012, 0.0122]) 17675 41 0.006698 0.103338 98.758373 68.970598
7 Bin8((0.0119, 0.012]) 17681 35 0.005718 0.103373 99.330175 79.307885
8 Bin9((0.0117, 0.0119]) 17686 30 0.004901 0.103402 99.820291 89.648096
9 Bin10((0.010499999999999999, 0.0117]) 17706 11 0.001797 0.103519 100.000000 100.000000

Inferences:

  • Accuracy score is 0.91
  • AUC score and Concordance index are 0.97, which are the best so far
  • Classfication report is also balanced between both the classes
  • ROC curve and PR are the best so far

Hence we can freeze the model.

21. Feature Importance

Feature importance refers to techniques that assign a score to input features based on how useful they are at predicting a target variable.

In [28]:
lgbmclassifier = grid.best_estimator_
In [29]:
lgbmclassifier.feature_importances_ 
Out[29]:
array([189,  60, 657, 568, 106,  58, 251,  89, 488,  10, 261, 250, 119,
       222, 197,  17,  31,  63, 146,  28,  65, 103,  62, 171,  62, 298,
       130, 193, 267, 144, 232,  96,  43, 266,  95, 195, 123,  39,  60,
        94, 286,   0,  20,  29,  66,  61,  58,   8,   8,  16,  42,  81,
         2,   0,  38,  31,   8,   1,  10,   2,  35,  14,   2,   1,  11,
        86,  64,   0,   0,   8,   9,   4,  33,   0,   0,   2,   1,   2,
         0,  33,   0,   0,   0,   0,   0,   0,   0,   0,  22,  15,   1,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   3,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   2,   0,   0,   0,   0,   1,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   1,   0,   1,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   1,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   1,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
        94, 172, 343,  52,   8, 232, 189, 344, 167, 188, 161, 135, 118,
       114, 177, 100, 242, 188, 178, 169, 146, 185, 186, 199, 161, 167,
       167, 136, 155, 141, 194, 184, 125, 110, 149, 186, 194, 192, 200])
In [30]:
feature_importance_df = pd.DataFrame({'feature' : X_train.columns, 'importance' : lgbmclassifier.feature_importances_ })
In [31]:
feature_importance_df = feature_importance_df.sort_values(by="importance", ascending=False)
feature_importance_df = feature_importance_df.iloc[:30,:]
feature_importance_df
Out[31]:
feature importance
2 card1 657
3 card2 568
8 addr1 488
501 TransactionAmt_to_std_card1 344
496 _Days 343
25 C13 298
40 D15 286
28 D2 267
33 D8 266
10 dist1 261
6 card5 251
11 P_emaildomain 250
510 PCA_V_7 242
499 TransactionAmt_to_mean_card1 232
30 D4 232
13 C1 222
532 PCA_V_29 200
517 PCA_V_14 199
14 C2 197
35 D10 195
524 PCA_V_21 194
530 PCA_V_27 194
27 D1 193
531 PCA_V_28 192
500 TransactionAmt_to_mean_card4 189
0 TransactionAmt 189
503 PCA_V_0 188
511 PCA_V_8 188
516 PCA_V_13 186
529 PCA_V_26 186
In [32]:
plt.figure(figsize=(16, 12));
sns.barplot(x="importance", y="feature", data=feature_importance_df.sort_values(by="importance", ascending=False));
plt.title('LGB Features');

Inferences:

  • card1 is contributing the most in predicting if a transaction is fraud or not
  • card2, addr1, C13, P_emaildomain, C1 etc are some of the most important features in predicting the fraud
  • Certain card types, addresses and emails are at high risk of fraud, so there is a need to monitor these carefully

22. Partial Dependence and Individual Conditional Expectations (ICE)

In [33]:
## pdp plots
from sklearn.inspection import partial_dependence, plot_partial_dependence
from sklearn.utils import validation

Fit the model

In [36]:
lgbmclassifier.fit(X_train, y_train)
lgbmclassifier.dummy_ = "dummy"

validation.check_is_fitted(estimator=lgbmclassifier)

Plot Partial Dependence

In [39]:
fig = plt.figure(figsize=(16, 12))
plot_partial_dependence(lgbmclassifier, X, ['card2'])
plt.show()
<Figure size 1152x864 with 0 Axes>

Individual Conditional Expectation (ICE) Plot - card2

In [133]:
plot_partial_dependence(lgbmclassifier, X, ['card2'], kind='both')
Out[133]:
<sklearn.inspection._plot.partial_dependence.PartialDependenceDisplay at 0x11225b5bd68>

Partial Dependence and ICE Plot - C13

In [132]:
fig = plt.figure(figsize=(16, 12))
plot_partial_dependence(lgbmclassifier, X, ['C13'], kind='both')
plt.show()
<Figure size 1152x864 with 0 Axes>
In [41]:
fig = plt.figure(figsize=(16, 12))
plot_partial_dependence(lgbmclassifier, X, ['C13'])
plt.show()
<Figure size 1152x864 with 0 Axes>

23. SHAP Values

SHAP values is used to reverse engineer the output of the prediction model and quantify the contribution of each predictor for a given prediction.

In [125]:
import shap
shap_model = shap.TreeExplainer(lgbmclassifier)
shap_values = shap_model.shap_values(X_train)

You can make a partial dependence plot using shap.dependence_plot. This shows the relationship between the feature and the Y. This also automatically includes another feature that your feature interacts frequently with.

In [55]:
# card2
shap.dependence_plot("card2", shap_values[0], X_train)
In [56]:
# card3
shap.dependence_plot("card3", shap_values[0], X_train)

Explain a single observation.

In [127]:
shap.initjs()  # needed to show viz
shap.force_plot(shap_model.expected_value[1], shap_values[1][14], X_train.iloc[14, :])
Out[127]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Add link = "logit"

In [129]:
shap.initjs()  # needed to show viz
shap.force_plot(shap_model.expected_value[1], shap_values[1][14], X_train.iloc[14, :], link='logit')
Out[129]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [104]:
y_pred_calib_tr[14]
Out[104]:
array([0.89570106, 0.10429894])
In [128]:
# compute SHAP values
explainer = shap.Explainer(lgbmclassifier, X_train) # , link=shap.links.logit)
shap_values_waterfall = explainer(X_train[:100])

# visualize the first prediction's explanation
shap.plots.waterfall(shap_values_waterfall[0])

Conclusion

The model has been trained and tested, so now one can use it to predict if any transaction would be fraud or not.

In [ ]:
 
In [ ]:
 
In [ ]: